Unit 2 - Foundation of Inference

10 min read

The trinity of statistical inference: estimation, confidence intervals and testing.

  • Estimator: one value whose performance can be measured by consistency, asymptotic normality, bias, variance and quadratic risk.
  • Confidence intervals provide “error bars” around estimators. Their size depends on the confidence level.
  • Hypothesis testing: we want to ask a yes/no answer about an unknown parameter. They are characterized by hypotheses, level, power, test statistic and rejection region. Under the null hypothesis, the value of the unknown parameter becomes known (no need for plug-in).

Statistical model

Formal definition:

Let the observed outcome of a statistical experiment be a sampleX1,X2,,XnX_1,X_2,\ldots ,X_ n of n i.i.d. random variables in some measurable space EE (usually ERE \subseteq \mathbb R) and denote by P\mathbf P their common distribution. A statistical model associated to that statistical experiment is a pair:

(E,(Pθ)θΘ)\left(E, ( \mathbf P_\theta ) _ {\theta \in \Theta }\right)

where:

  • EE is called sample space
  • (Pθ)θΘ( \mathbf P_\theta ) _ {\theta \in \Theta } is a family of probability measured on E
  • Θ\Theta is any set , called parameter set.

For example: the statistical model of Bernoulli distribution: ({0,1},(Ber(p))p(0,1))(\{0,1\},(\textsf{Ber}(p)) _ {p \in (0,1)} )

Parametric, nonparametric and semiparametric models

Usually, we will assume that the statistical model is well specified, i.e., defined such that P=Pθ\mathbf P=\mathbf P_\theta, for some θΘ\theta \subseteq \Theta. This particular θ\theta is called the true parameter, and is unknown: The aim of the statistical experiment is to estimate θ\theta, or check it’s properties when they have a special meaning.

  • if ΘRd\Theta \subseteq \mathbb R^d for some d1d\geq 1, the model is called parametric
  • if Θ\Theta is infinite dimensional, the model is called nonparametric
  • if Θ=Θ1×Θ2\Theta=\Theta _1\times \Theta_2, where Θ1\Theta _1 is finite dimensional, and Θ2\Theta_2 is infinite dimensional, then the model is called semiparametric.

Identifiability

The parameter θ\theta is called identifiable iff the map θΘPθ\theta \in \Theta \mapsto \mathbf P_\theta is injective, i.e.:

θθPθPθ\theta \neq \theta ' \Rightarrow \mathbf P_\theta \neq \mathbf P_ {\theta '}

or equivalently:

Pθ=Pθθ=θ\mathbf P_\theta = \mathbf P_ {\theta '} \Rightarrow \theta = \theta '

Estimation

A statistic is any measurable function of the sample.

An estimator of θ\theta is a statistic θ^n=θ^n(X1,,Xn)\hat\theta _ n = \hat\theta _ n(X_1,\ldots , X_ n) whose expression does not depend on θ\theta.

  • An estimator θ^n\hat\theta _ n of θ\theta is weakly (resp. strongly) consistent if θ^nnP/(resp. a.s.)θ\hat\theta _ n \xrightarrow [n\rightarrow \infty ]{\mathbf P/(resp.\ a.s.)} \theta
  • An estimator θ^n\hat\theta _ n of θ\theta is asymptotically normal if n(θ^nθ)n(d)N(0,σ2)\sqrt n(\hat\theta _ n-\theta) \xrightarrow [n\rightarrow \infty ]{(d)} \mathcal{N}(0,\sigma^2). The quantity σ2\sigma ^2 is then called asymptotic variance of θ^n\hat\theta _ n.

Bias of an estimator θ^n\hat\theta _ n of θ\theta:

bias(θ^n)=E[θ^n]θ\text {bias}(\hat\theta _ n)=\mathbb E[\hat\theta _ n]-\theta

If bias(θ^n)=0\text {bias}(\hat\theta _ n)=0, we say that θ^\hat\theta is unbiased.

We want estimators to have low bias and low variance at the same time.

The Risk (or quadratic risk) of an estimator θ^nR\hat\theta _ n\in \mathbb R is

R(θ^n)=E[θ^nθ2]\mathbf R(\hat\theta _ n)=\mathbb E[\vert\hat\theta _ n-\theta\vert^2]

which means: Quadratic Risk=Variance+Bias2\text {Quadratic Risk}=\text {Variance}+\text {Bias}^2

For example: for Bernoulli distribution ({0,1},(Ber(p))p(0,1))(\{0,1\},(\textsf{Ber}(p)) _ {p \in (0,1)} ), using p^n=Xn\hat {p} _ n = \overline{X} _ n as an estimator for pp, this estimator is unbiased, consistent, and its quadratic risk tends to 0 as the sample size nn \to \infty.

Confidence Intervals

Let (E,(Pθ)θΘ)\left(E, ( \mathbf P_\theta ) _ {\theta \in \Theta }\right) be a statistical model based on observations X1,X2,,XnX_1,X_2,\ldots ,X_ n and assume ΘR\Theta \subseteq \mathbb R. Let α(0,1)\alpha \in (0,1).

  • Confidence interval (C.I.) of level 1α1-\alpha for θ\theta: any random (depending on X1,X2,,XnX_1,X_2,\ldots ,X_ n) interval I\mathcal I whose boudnaries do not depend on θ\theta and such that:

    Pθ[Iθ]1α,θΘ\mathbf P_\theta [\mathcal I\ni \theta]\geq 1-\alpha, \quad \forall \theta \in \Theta

  • C.I. of asymptotic level 1α1-\alpha for θ\theta: any random interval I\mathcal I whose boundaries do not depend on θ\theta and such that:

    limnP[Iθ]1α,θΘ\lim _ {n\to \infty } \mathbf{P} [\mathcal I\ni \theta]\geq 1-\alpha, \quad \forall \theta \in \Theta

Be aware that it is P1α\mathbf P \geq 1-\alpha, not P=1α\mathbf P = 1-\alpha.

For example: for Bernoulli distribution ({0,1},(Ber(p))p(0,1))(\{0,1\},(\textsf{Ber}(p)) _ {p \in (0,1)} ), using p^n=Xn\hat{p } _ n = \overline{X} _ n as an estimator for pp, and from CLT:

nXnpp(1p)n(d)N(0,1)\sqrt{n}\frac{\overline{X} _ n-p}{\sqrt{p(1-p)}}\xrightarrow [n\rightarrow \infty ]{(d)} \mathcal{N}(0,1)

For a fixed α(0,1)\alpha \in (0,1), if qα/2q_ {\alpha/2} is the (1α/2(1-\alpha/2)-quantile of N(0,1)\mathcal{N}(0,1), then with probability 1α\simeq 1-\alpha (if nn is large enough),

Xn[pqα/2p(1p)n,p+qα/2p(1p)n]\overline{X} _ n \in [p-\frac {q_ {\alpha/2}\sqrt{p(1-p)}}{\sqrt n},p+\frac {q_ {\alpha/2}\sqrt{p(1-p)}}{\sqrt n}]

It yields:

limnP([Xnqα/2p(1p)n,Xn+qα/2p(1p)n]p)=1α.\lim _ {n\to \infty } \mathbf{P}( [\overline{X} _ n-\frac {q_ {\alpha/2}\sqrt{p(1-p)}}{\sqrt n},\overline{X} _ n+\frac {q_ {\alpha/2}\sqrt{p(1-p)}}{\sqrt n}] \ni p)=1-\alpha .

But it is not a confidence interval, because it depends on p !! Three solutions are presented below.

Conservative bound

Since p(1p)1/4p(1-p)\leq 1/4, roughly with probability at least 1α1-\alpha,

Iconserv=[Xnqα/22n,Xn+qα/22n]\mathcal I_ {\textsf {conserv}}=[\overline{X} _ n -\frac {q_ {\alpha/2}}{2\sqrt n},\overline{X} _ n +\frac {q_ {\alpha/2}}{2\sqrt n}]

Indeed: limnP(Iconservp)1α.\lim _ {n\to \infty } \mathbf{P}( \mathcal I_ {\textsf {conserv}} \ni p)\geq 1-\alpha .

Solving the (quadratic) equation for p

From

Xnqα/2p(1p)npXn+qα/2p(1p)n\overline{X} _ n-\frac {q_ {\alpha/2}\sqrt{p(1-p)}}{\sqrt n} \leq p \leq \overline{X} _ n+\frac {q_ {\alpha/2}\sqrt{p(1-p)}}{\sqrt n}

we can get

(pXn)2qα/22p(1p)n(p-\overline{X} _ n)^2\leq \frac {q_ {\alpha/2}^2p(1-p)}{n}

We need to find the roots p1<p2p_1<p_2 of

(1+qα/22n)p2(2Xn+qα/22n)p+Xn2=0(1+\frac {q_ {\alpha/2}^2}{n})p^2-(2\overline{X} _ n+\frac {q_ {\alpha/2}^2}{n} )p+\overline{X} _ n^2=0

This leads to Isolve=[p1,p2]\mathcal I_ {\textsf {solve}}=[p_1,p_2], such that: limnP(Isolvep)=1α\lim _ {n\to \infty } \mathbf{P}( \mathcal I_ {\textsf {solve}} \ni p)= 1-\alpha.

Plug-in

This method uses the estimated p^\hat p to calculate the variance.

By LLN: p^=XnnP, a.s.p\hat p=\overline{X} _ n\xrightarrow [n\rightarrow \infty ]{\mathbf P,\ a.s.} p, and by Slutsky:

nXnpp^(1p^)n(d)N(0,1)\sqrt{n}\frac{\overline{X} _ n-p}{\sqrt{\hat p(1-\hat p)}}\xrightarrow [n\rightarrow \infty ]{(d)} \mathcal{N}(0,1)

This leads to:

Iplug-in=[Xnqα/2p^(1p^)n,Xn+qα/2p^(1p^)n]\mathcal I_ {\textsf {plug-in}} = [\overline{X} _ n-\frac {q_ {\alpha/2}\sqrt{\hat p(1-\hat p)}}{\sqrt n},\overline{X} _ n+\frac {q_ {\alpha/2}\sqrt{\hat p(1-\hat p)}}{\sqrt n}]

such that: limnP(Iplug-inp)=1α.\lim _ {n\to \infty } \mathbf{P}( \mathcal I_ {\textsf {plug-in}} \ni p)= 1-\alpha .

Meaning of confidence interval

There is a frequentist interpretation:

95% C.I. means if we were to repeat the experiment then the true parameter θ\theta would be in the resulting confidence interval about 95% of the time.

It is wrong to say that

By 95% of chance that the true parameter θ\theta is in the resulting confidence interval

Because from the frequentists’ point of view, the true parameter θ\theta is deterministic (fixed, even though unknown). Once the confidence interval is calculated, we can only say that the true parameter θ\theta is in the C.I. or not, like a Bernoulli distribution, only 1 or 0 is taken. But I suppose we can say that:

The expectation of that Bernoulli distribution is 95%.

Steps to find a confidence interval

  1. Find an estimator for θ^\hat\theta for θ\theta
  2. Determine the (asymptotic) distribution of θ^\hat\theta
  3. Compute a confidence interval for θ\theta based on θ^\hat\theta with level α\alpha

Delta method

Exponential distribution example (1/2)

Take Exponential distribution as an example, PDF: f(t)=λeλt, t0f(t)=\lambda e^{-\lambda t}, \ \forall t \geq 0.

Let X1,X2,,Xniidexp(λ)X_1, X_2, \ldots , X_ n \stackrel{iid}{\sim } \exp (\lambda ), and its sample mean: Xn:=1ni=1nXi\overline{X} _ n := \frac{1}{n} \sum _ {i = 1}^ n X_ i. By LLN: Xnna.s./P1λ\overline{X} _ n \xrightarrow [n\rightarrow \infty ]{a.s. / \mathbf P} \frac 1\lambda, because E[X1]=1λ\mathbb E[X_1]=\frac 1\lambda.

So a natural estimator of λ\lambda is:

λ^:=1Xn\hat\lambda:=\frac 1{\overline{X} _ n}

Hence: λ^na.s./Pλ\hat\lambda \xrightarrow [n\rightarrow \infty ]{a.s. / \mathbf P} \lambda.

Be careful that, E[1X1]>1E[X1]=λ\mathbb E[\frac 1{X_1}]>\frac 1{\mathbb E[X_1]}=\lambda.

By CLT:

n(Xn1λ)n(d)N(0,λ2)\sqrt{n}(\overline{X} _ n -\frac 1\lambda ) \xrightarrow [n \to \infty ]{(d)}\mathcal{N}(0,\lambda^{-2})

How does the CLT transfer to λ^\hat\lambda? How to find an asymptotic confidence interval for λ\lambda? Here we need to use the Delta method.

The Delta method

Let (Zn)n1(Z_n) _ {n \geq 1} sequence of r.v. that satisfies

n(Znθ)n(d)N(0,σ2)\sqrt{n}(Z_ n - \theta ) \xrightarrow [n \to \infty ]{(d)}\mathcal{N}(0,\sigma^2)

for some θR\theta \in \mathbb R and σ2>0\sigma^2>0 (the sequence (Zn)n1(Z_n) _ {n \geq 1} is said to be asymptotically normal around θ\theta).

Let g:RRg:\mathbb R \to \mathbb R be continuously differentiable at the point θ\theta. Then, (g(Zn))n1(g(Z_n)) _ {n \geq 1} is also asymptotically normal around g(θ)g(\theta); More precisely:

n(g(Zn)g(θ))n(d)N(0,(g(θ))2σ2)\sqrt{n}(g(Z_ n) - g(\theta) ) \xrightarrow [n \to \infty ]{(d)}\mathcal{N}(0,(g'(\theta))^2\sigma^2)

Exponential distribution example (2/2)

By using the delta method, g(x)=1xg(x)=\frac 1x,

n(λ^λ)n(d)N(0,λ2)\sqrt{n}(\hat\lambda -\lambda ) \xrightarrow [n \to \infty ]{(d)}\mathcal{N}(0,\lambda^2)

To calculate the asymptotic confidence interval for λ\lambda:

[λ^qα/2λn,λ^+qα/2λn][\hat\lambda -\frac {q_ {\alpha/2}\lambda}{\sqrt n},\hat\lambda +\frac {q_ {\alpha/2}\lambda}{\sqrt n}]

Then we can use “Solve” or “Plug-in” method to get confidence interval for λ\lambda.

Hypothesis testing

Statistical formulation

  • Consider a sample X1,X2,,XnX_1,X_2,\ldots ,X_ n of i.i.d. random variables and a statistical model (E,(Pθ)θΘ)\left(E, ( \mathbf P_\theta ) _ {\theta \in \Theta }\right)
  • Let Θ0\Theta _0 and Θ1\Theta _1 be disjoint subsets of Θ\Theta
  • Consider the two hypotheses:
    • H0:θΘ0H_0: \theta \in \Theta _0
    • H1:θΘ1H_1: \theta \in \Theta _1
  • H0H_0 is the null hypothesis, H1H_1 is the alternative hypothesis
  • If we believe that the true θ\theta is either in H0H_0 or in H1H_1, we may want to test H0H_0 against H1H_1
  • We want to decide whether to reject H0H_0 (look for evidence against H0H_0 in the data)

Asymmetry in the hypotheses

  • H0H_0 and in H1H_1do not play a symmetric role: the data is is only used to try to disprove H0H_0
  • In particular lack of evidence, does not mean that H0H_0 is true (“innocent until proven guilty”)
  • A test is a statistic ψ{0,1}\psi \in \{0,1\} such that:
    • If ψ=0\psi =0, H0H_0 is not rejected
    • If ψ=1\psi =1, H0H_0 is rejected

Errors

  • Rejection region of a test ψ\psi:

    Rψ={xEn:ψ(x)=1}R_\psi=\{x \in E^n:\psi (x)=1\}

  • Type I error of a test ψ\psi (rejecting H0H_0 when it is actually true): αψ\alpha_\psi

  • Type II error of a test ψ\psi (not rejecting H0H_0 although H1H_1 is actually true): βψ\beta_\psi

  • Power of a test ψ\psi:

    πψ=infθΘ1(1βψ(θ))\pi_\psi=\inf\limits_ {\theta \in \Theta_1}(1-\beta_\psi(\theta))

Level, test statistic and rejection region

  • A test has level α\alpha if:

    αψ(θ)α,θΘ0\alpha_\psi(\theta) \leq \alpha, \quad \forall \theta \in \Theta_0

  • A test has asymptotic level α\alpha if:

    limnαψn(θ)α,θΘ0\lim_ {n \to \infty}\alpha_ {\psi _ n}(\theta) \leq \alpha, \quad \forall \theta \in \Theta_0

  • In general, a test has the form:

    ψ=1{Tn>c}\psi=\mathbf 1\{T_n>c\}

    for some statistic TnT_n and threshold cRc \in \mathbb R

  • TnT_n is called the test statistic. The rejection region is Rψ={Tn>c}R_\psi=\{T_n>c\}

One-sided vs two-sided tests

We can refine the terminology when θΘR\theta \in \Theta \subset \mathbb R and H0H_0 is of the form:

H0:θ=θ0    Θ0={θ0}H_0: \theta=\theta_0 \iff \Theta_0=\{ \theta_0\}

  • If H1:θθ0H_1:\theta \neq \theta_0: two-sided test
  • if H1:θ>θ0H_1:\theta > \theta_0 or H1:θ<θ0H_1:\theta < \theta_0: one-sided test

p-value

The (asymptotic) p-value of a test αψ\alpha_\psi is the smallest (asymptotic) level α\alpha at which αψ\alpha_\psi rejects H0H_0. It is random, it depends on the sample.

p-valueα    H0\text {p-value} \leq \alpha \iff H_0 is rejected by ψα\psi_\alpha, at the (asymptotic) level α\alpha

The smaller the p-value, the more confidently one can reject H0H_0.

Steps of hypothesis testing

  1. Find estimators
  2. Find pivot and determine the distribution of pivot. Write some statistic TnT_n, and let ψ=1{Tn>c}\psi=\mathbf 1\{T_n>c\}

It is pivot if we can manage to write it down in such a way that it’s distribution under the null hypothesis is known and does not depend on any additional parameters.

  1. Adjust cc to match level α\alpha.