The trinity of statistical inference: estimation, confidence intervals and testing.
 Estimator: one value whose performance can be measured by consistency, asymptotic normality, bias, variance and quadratic risk.
 Confidence intervals provide “error bars” around estimators. Their size depends on the confidence level.
 Hypothesis testing: we want to ask a yes/no answer about an unknown parameter. They are characterized by hypotheses, level, power, test statistic and rejection region. Under the null hypothesis, the value of the unknown parameter becomes known (no need for plugin).
Statistical model
Formal definition:
Let the observed outcome of a statistical experiment be a sample$X_1,X_2,\ldots ,X_ n$ of n i.i.d. random variables in some measurable space $E$ (usually $E \subseteq \mathbb R$) and denote by $\mathbf P$ their common distribution. A statistical model associated to that statistical experiment is a pair:
\[\left(E, ( \mathbf P_\theta ) _ {\theta \in \Theta }\right)\]where:
 $E$ is called sample space
 $( \mathbf P_\theta ) _ {\theta \in \Theta }$ is a family of probability measured on E
 $\Theta$ is any set , called parameter set.
For example: the statistical model of Bernoulli distribution: $(\{0,1\},(\textsf{Ber}(p)) _ {p \in (0,1)} )$
Parametric, nonparametric and semiparametric models
Usually, we will assume that the statistical model is well specified, i.e., defined such that $\mathbf P=\mathbf P_\theta$, for some $\theta \subseteq \Theta$. This particular $\theta$ is called the true parameter, and is unknown: The aim of the statistical experiment is to estimate $\theta$, or check it’s properties when they have a special meaning.
 if $\Theta \subseteq \mathbb R^d$ for some $d\geq 1$, the model is called parametric
 if $\Theta$ is infinite dimensional, the model is called nonparametric
 if $\Theta=\Theta _1\times \Theta_2$, where $\Theta _1$ is finite dimensional, and $\Theta _2$ is infinite dimensional, then the model is called semiparametric.
Identifiability
The parameter $\theta$ is called identifiable iff the map $\theta \in \Theta \mapsto \mathbf P_\theta$ is injective, i.e.:
\[\theta \neq \theta ' \Rightarrow \mathbf P_\theta \neq \mathbf P_ {\theta '}\]or equivalently:
\[\mathbf P_\theta = \mathbf P_ {\theta '} \Rightarrow \theta = \theta '\]Estimation
A statistic is any measurable function of the sample.
An estimator of $\theta$ is a statistic $\hat\theta _ n = \hat\theta _ n(X_1,\ldots , X_ n)$ whose expression does not depend on $\theta$.
 An estimator $\hat\theta _ n$ of $\theta$ is weakly (resp. strongly) consistent if $\hat\theta _ n \xrightarrow [n\rightarrow \infty ]{\mathbf P/(resp.\ a.s.)} \theta$
 An estimator $\hat\theta _ n$ of $\theta$ is asymptotically normal if $\sqrt n(\hat\theta _ n\theta) \xrightarrow [n\rightarrow \infty ]{(d)} \mathcal{N}(0,\sigma^2)$. The quantity $\sigma ^2$ is then called asymptotic variance of $\hat\theta _ n$.
Bias of an estimator $\hat\theta _ n$ of $\theta$:
\[\text {bias}(\hat\theta _ n)=\mathbb E[\hat\theta _ n]\theta\]If $\text {bias}(\hat\theta _ n)=0$, we say that $\hat\theta$ is unbiased.
We want estimators to have low bias and low variance at the same time.
The Risk (or quadratic risk) of an estimator $\hat\theta _ n\in \mathbb R$ is
\[\mathbf R(\hat\theta _ n)=\mathbb E[\vert\hat\theta _ n\theta\vert^2]\]which means: $\text {Quadratic Risk}=\text {Variance}+\text {Bias}^2$
For example: for Bernoulli distribution $(\{0,1\},(\textsf{Ber}(p)) _ {p \in (0,1)} )$, using $\hat {p} _ n = \overline{X} _ n$ as an estimator for $p$, this estimator is unbiased, consistent, and its quadratic risk tends to 0 as the sample size $n \to \infty$.
Confidence Intervals
Let $\left(E, ( \mathbf P_\theta ) _ {\theta \in \Theta }\right)$ be a statistical model based on observations $X_1,X_2,\ldots ,X_ n$ and assume $\Theta \subseteq \mathbb R$. Let $\alpha \in (0,1)$.

Confidence interval (C.I.) of level $1\alpha$ for $\theta$: any random (depending on $X_1,X_2,\ldots ,X_ n$) interval $\mathcal I$ whose boudnaries do not depend on $\theta$ and such that:
\[\mathbf P_\theta [\mathcal I\ni \theta]\geq 1\alpha, \quad \forall \theta \in \Theta\] 
C.I. of asymptotic level $1\alpha$ for $\theta$: any random interval $\mathcal I$ whose boundaries do not depend on $\theta$ and such that:
\[\lim _ {n\to \infty } \mathbf{P} [\mathcal I\ni \theta]\geq 1\alpha, \quad \forall \theta \in \Theta\]
Be aware that it is $\mathbf P \geq 1\alpha$, not $\mathbf P = 1\alpha$.
For example: for Bernoulli distribution $(\{0,1\},(\textsf{Ber}(p)) _ {p \in (0,1)} )$, using $\hat{p } _ n = \overline{X} _ n$ as an estimator for $p$, and from CLT:
\[\sqrt{n}\frac{\overline{X} _ np}{\sqrt{p(1p)}}\xrightarrow [n\rightarrow \infty ]{(d)} \mathcal{N}(0,1)\]For a fixed $\alpha \in (0,1)$, if $q_ {\alpha/2}$ is the $(1\alpha/2$)quantile of $\mathcal{N}(0,1)$, then with probability $\simeq 1\alpha$ (if $n$ is large enough),
\[\overline{X} _ n \in [p\frac {q_ {\alpha/2}\sqrt{p(1p)}}{\sqrt n},p+\frac {q_ {\alpha/2}\sqrt{p(1p)}}{\sqrt n}]\]It yields:
\[\lim _ {n\to \infty } \mathbf{P}( [\overline{X} _ n\frac {q_ {\alpha/2}\sqrt{p(1p)}}{\sqrt n},\overline{X} _ n+\frac {q_ {\alpha/2}\sqrt{p(1p)}}{\sqrt n}] \ni p)=1\alpha .\]But it is not a confidence interval, because it depends on p !! Three solutions are presented below.
Conservative bound
Since $p(1p)\leq 1/4$, roughly with probability at least $1\alpha$,
\[\mathcal I_ {\textsf {conserv}}=[\overline{X} _ n \frac {q_ {\alpha/2}}{2\sqrt n},\overline{X} _ n +\frac {q_ {\alpha/2}}{2\sqrt n}]\]Indeed: $\lim _ {n\to \infty } \mathbf{P}( \mathcal I_ {\textsf {conserv}} \ni p)\geq 1\alpha .$
Solving the (quadratic) equation for p
From
\[\overline{X} _ n\frac {q_ {\alpha/2}\sqrt{p(1p)}}{\sqrt n} \leq p \leq \overline{X} _ n+\frac {q_ {\alpha/2}\sqrt{p(1p)}}{\sqrt n}\]we can get
\[(p\overline{X} _ n)^2\leq \frac {q_ {\alpha/2}^2p(1p)}{n}\]We need to find the roots $p_1<p_2$ of
\[(1+\frac {q_ {\alpha/2}^2}{n})p^2(2\overline{X} _ n+\frac {q_ {\alpha/2}^2}{n} )p+\overline{X} _ n^2=0\]This leads to $\mathcal I_ {\textsf {solve}}=[p_1,p_2]$, such that: $\lim _ {n\to \infty } \mathbf{P}( \mathcal I_ {\textsf {solve}} \ni p)= 1\alpha$.
Plugin
This method uses the estimated $\hat p$ to calculate the variance.
By LLN: $\hat p=\overline{X} _ n\xrightarrow [n\rightarrow \infty ]{\mathbf P,\ a.s.} p$, and by Slutsky:
\[\sqrt{n}\frac{\overline{X} _ np}{\sqrt{\hat p(1\hat p)}}\xrightarrow [n\rightarrow \infty ]{(d)} \mathcal{N}(0,1)\]This leads to:
\[\mathcal I_ {\textsf {plugin}} = [\overline{X} _ n\frac {q_ {\alpha/2}\sqrt{\hat p(1\hat p)}}{\sqrt n},\overline{X} _ n+\frac {q_ {\alpha/2}\sqrt{\hat p(1\hat p)}}{\sqrt n}]\]such that: $\lim _ {n\to \infty } \mathbf{P}( \mathcal I_ {\textsf {plugin}} \ni p)= 1\alpha .$
Meaning of confidence interval
There is a frequentist interpretation:
95% C.I. means if we were to repeat the experiment then the true parameter $\theta$ would be in the resulting confidence interval about 95% of the time.
It is wrong to say that
By 95% of chance that the true parameter $\theta$ is in the resulting confidence interval
Because from the frequentists’ point of view, the true parameter $\theta$ is deterministic (fixed, even though unknown). Once the confidence interval is calculated, we can only say that the true parameter $\theta$ is in the C.I. or not, like a Bernoulli distribution, only 1 or 0 is taken. But I suppose we can say that:
The expectation of that Bernoulli distribution is 95%.
Steps to find a confidence interval
 Find an estimator for $\hat\theta$ for $\theta$
 Determine the (asymptotic) distribution of $\hat\theta$
 Compute a confidence interval for $\theta$ based on $\hat\theta$ with level $\alpha$
Delta method
Exponential distribution example (1/2)
Take Exponential distribution as an example, PDF: $f(t)=\lambda e^{\lambda t}, \ \forall t \geq 0$.
Let $X_1, X_2, \ldots , X_ n \stackrel{iid}{\sim } \exp (\lambda )$, and its sample mean: $\overline{X} _ n := \frac{1}{n} \sum _ {i = 1}^ n X_ i$. By LLN: $\overline{X} _ n \xrightarrow [n\rightarrow \infty ]{a.s. / \mathbf P} \frac 1\lambda$, because $\mathbb E[X_1]=\frac 1\lambda$.
So a natural estimator of $\lambda$ is:
\[\hat\lambda:=\frac 1{\overline{X} _ n}\]Hence: $\hat\lambda \xrightarrow [n\rightarrow \infty ]{a.s. / \mathbf P} \lambda$.
Be careful that, $\mathbb E[\frac 1{X_1}]>\frac 1{\mathbb E[X_1]}=\lambda$.
By CLT:
\[\sqrt{n}(\overline{X} _ n \frac 1\lambda ) \xrightarrow [n \to \infty ]{(d)}\mathcal{N}(0,\lambda^{2})\]How does the CLT transfer to $\hat\lambda$? How to find an asymptotic confidence interval for $\lambda$? Here we need to use the Delta method.
The Delta method
Let $(Z_n) _ {n \geq 1}$ sequence of r.v. that satisfies
\[\sqrt{n}(Z_ n  \theta ) \xrightarrow [n \to \infty ]{(d)}\mathcal{N}(0,\sigma^2)\]for some $\theta \in \mathbb R$ and $\sigma^2>0$ (the sequence $(Z_n) _ {n \geq 1}$ is said to be asymptotically normal around $\theta$).
Let $g:\mathbb R \to \mathbb R$ be continuously differentiable at the point $\theta$. Then, $(g(Z_n)) _ {n \geq 1}$ is also asymptotically normal around $g(\theta)$; More precisely:
\[\sqrt{n}(g(Z_ n)  g(\theta) ) \xrightarrow [n \to \infty ]{(d)}\mathcal{N}(0,(g'(\theta))^2\sigma^2)\]Exponential distribution example (2/2)
By using the delta method, $g(x)=\frac 1x$,
\[\sqrt{n}(\hat\lambda \lambda ) \xrightarrow [n \to \infty ]{(d)}\mathcal{N}(0,\lambda^2)\]To calculate the asymptotic confidence interval for $\lambda$:
\[[\hat\lambda \frac {q_ {\alpha/2}\lambda}{\sqrt n},\hat\lambda +\frac {q_ {\alpha/2}\lambda}{\sqrt n}]\]Then we can use “Solve” or “Plugin” method to get confidence interval for $\lambda$.
Hypothesis testing
Statistical formulation
 Consider a sample $X_1,X_2,\ldots ,X_ n$ of i.i.d. random variables and a statistical model $\left(E, ( \mathbf P_\theta ) _ {\theta \in \Theta }\right)$
 Let $\Theta _0$ and $\Theta _1$ be disjoint subsets of $\Theta$
 Consider the two hypotheses:
 $H_0: \theta \in \Theta _0$
 $H_1: \theta \in \Theta _1$
 $H_0$ is the null hypothesis, $H_1$ is the alternative hypothesis
 If we believe that the true $\theta$ is either in $H_0$ or in $H_1$, we may want to test $H_0$ against $H_1$
 We want to decide whether to reject $H_0$ (look for evidence against $H_0$ in the data)
Asymmetry in the hypotheses
 $H_0$ and in $H_1$do not play a symmetric role: the data is is only used to try to disprove $H_0$
 In particular lack of evidence, does not mean that $H_0$ is true (“innocent until proven guilty”)
 A test is a statistic $\psi \in \{0,1\}$ such that:
 If $\psi =0$, $H_0$ is not rejected
 If $\psi =1$, $H_0$ is rejected
Errors

Rejection region of a test $\psi$:
\[R_\psi=\{x \in E^n:\psi (x)=1\}\]  Type I error of a test $\psi$ (rejecting $H_0$ when it is actually true): $\alpha_\psi$
 Type II error of a test $\psi$ (not rejecting $H_0$ although $H_1$ is actually true): $\beta_\psi$

Power of a test $\psi$:
\[\pi_\psi=\inf\limits_ {\theta \in \Theta_1}(1\beta_\psi(\theta))\]
Level, test statistic and rejection region

A test has level $\alpha$ if:
\[\alpha_\psi(\theta) \leq \alpha, \quad \forall \theta \in \Theta_0\] 
A test has asymptotic level $\alpha$ if:
\[\lim_ {n \to \infty}\alpha_ {\psi _ n}(\theta) \leq \alpha, \quad \forall \theta \in \Theta_0\] 
In general, a test has the form:
\[\psi=\mathbf 1\{T_n>c\}\]for some statistic $T_n$ and threshold $c \in \mathbb R$

$T_n$ is called the test statistic. The rejection region is $R_\psi=\{T_n>c\}$
Onesided vs twosided tests
We can refine the terminology when $\theta \in \Theta \subset \mathbb R$ and $H_0$ is of the form:
\[H_0: \theta=\theta_0 \iff \Theta_0=\{ \theta_0\}\] If $H_1:\theta \neq \theta_0$: twosided test
 if $H_1:\theta > \theta_0$ or $H_1:\theta < \theta_0$: onesided test
pvalue
The (asymptotic) pvalue of a test $\alpha_\psi$ is the smallest (asymptotic) level $\alpha$ at which $\alpha_\psi$ rejects $H_0$. It is random, it depends on the sample.
$\text {pvalue} \leq \alpha \iff H_0$ is rejected by $\psi_\alpha$, at the (asymptotic) level $\alpha$
The smaller the pvalue, the more confidently one can reject $H_0$.
Steps of hypothesis testing
 Find estimators
 Find pivot and determine the distribution of pivot. Write some statistic $T_n$, and let $\psi=\mathbf 1\{T_n>c\}$
It is pivot if we can manage to write it down in such a way that it’s distribution under the null hypothesis is known and does not depend on any additional parameters.
 Adjust $c$ to match level $\alpha$.