Unit 3 - Methods of Estimation

20 min read

When the parameter is not directly the expectation of samples (E(X)\mathbb E(X)), three estimation methods will be presented: Maximum likelihood estimation, Method of moments, and M-estimators.

Distance measures between distributions

Two methods are presented to measure the distance between distributions: Total variation distance and Kullback-Leibler (KL) divergence.

Total variation (TV) distance

Let (E,(Pθ)θΘ)\left(E, ( \mathbf P_\theta ) _ {\theta \in \Theta }\right) be a statistical model, and θ\theta^* is the true parameter, the statistician’s goal is that given X1,X2,,XnX_1,X_2,\ldots ,X_ n, find an estimator θ^=θ^(X1,X2,,Xn)\hat\theta=\hat\theta(X_1,X_2,\ldots ,X_ n) such that Pθ^\mathbf P _ {\hat\theta} is close to Pθ\*\mathbf P _ {\theta^\*}. This means: Pθ^Pθ\*\vert\mathbf P _ {\hat\theta} -\mathbf P _ {\theta^\*}\vert is small for all AEA \subset E. Here AA is a sub sample space.

The total variation distance between two probability measures Pθ\mathbf P _ \theta and Pθ\mathbf P _ {\theta '}with sample space E is defined by:

TV(Pθ,Pθ)=maxAEPθ(A)Pθ(A)\text {TV}(\mathbf{P} _ {\theta }, \mathbf{P} _ {\theta '})={\max _ {A \subset E}}\, \big \vert \mathbf{P} _ {\theta }(A)-\mathbf{P} _ {\theta '}(A)\big \vert

Let P\mathbf P and Q\mathbf Q be probability measures with a sample space EE and probability mass functions ff and gg. Then, the total variation distance between P\mathbf P and Q\mathbf Q:

TV(P,Q)=maxAEP(A)Q(A)\text {TV}(\mathbf{P}, \mathbf{Q}) = {\max _ {A \subset E}}\vert \mathbf{P}(A) - \mathbf{Q}(A) \vert

  • If E is discrete (total variation distance between discrete measures)

    TV(P,Q)=12xEf(x)g(x)\text {TV}(\mathbf{P}, \mathbf{Q}) = \frac{1}{2} \, \sum _ {x \in E} \vert f(x) - g(x)\vert

  • if E is continuous (total variation distance between continuous measures)

    TV(P,Q)=12xEf(x)g(x) dx\text {TV}(\mathbf{P}, \mathbf{Q}) = \frac{1}{2} \, \int _ {x \in E} \vert f(x) - g(x)\vert ~ \text {d}x

    It can be imagined as the half of the sum of the surface difference between their PDF. Here 12\frac12 is the normalization.

Properties of Total Variation Distance

dd is a distance on probability measures:

  • Symmetric: d(P,Q)=d(Q,P)d(\mathbf{P}, \mathbf{Q}) = d(\mathbf{Q}, \mathbf{P})
  • Nonnegative: d(P,Q)0d(\mathbf{P}, \mathbf{Q}) \geq 0
  • Definite: d(P,Q)=0    P=Qd(\mathbf{P}, \mathbf{Q}) = 0 \iff \mathbf{P}= \mathbf{Q}
  • Triangle inequality: d(P,V)d(P,Q)+d(Q,V)d(\mathbf{P}, \mathbf{V}) \leq d(\mathbf{P}, \mathbf{Q}) + d(\mathbf{Q}, \mathbf{V})

The total variation distance (TV) is a distance on probability measures.

Kullback-Leibler (KL) divergence

Let P\mathbf P and Q\mathbf Q be discrete probability distributions with PMFs pp and qq respectively. Let’s also assume P\mathbf P and Q\mathbf Q have a common sample space EE. Then the KL divergence (also known as relative entropy ) between P\mathbf P and Q\mathbf Q is defined by:

KL(P,Q)=xEp(x)ln(p(x)q(x))\text {KL}(\mathbf{P}, \mathbf{Q}) = \sum _ {x \in E} p(x) \ln \left( \frac{p(x)}{q(x)} \right)

where the sum is only over the support of P\mathbf P.

If P\mathbf P and Q\mathbf Q are continuous probability distributions with PDFs pp and qq on a common sample space EE, then:

KL(P,Q)=xEp(x)ln(p(x)q(x))dx\text {KL}(\mathbf{P}, \mathbf{Q}) = \int _ {x \in E} p(x) \ln \left( \frac{p(x)}{q(x)} \right) dx

where the sum is again only over the support of P\mathbf P.

Properties of KL-divergence

  • No-Symmetric: KL(P,Q)KL(Q,P)\text {KL}(\mathbf{P}, \mathbf{Q}) \neq \text {KL}(\mathbf{Q}, \mathbf{P})
  • Nonnegative: KL(P,Q)0\text {KL}(\mathbf{P}, \mathbf{Q}) \geq 0
  • Definite: KL(P,Q)=0    P=Q\text {KL}(\mathbf{P}, \mathbf{Q}) = 0 \iff \mathbf{P}= \mathbf{Q}
  • No triangle inequality: KL(P,V)KL(P,Q)+KL(Q,V)\text {KL}(\mathbf{P}, \mathbf{V}) \nleq \text {KL}(\mathbf{P}, \mathbf{Q}) + \text {KL}(\mathbf{Q}, \mathbf{V}) in general

The Kullback-Leibler (KL) divergence is NOT a distance.

For example, KL divergence between 2 Gaussian distributions P=N(a,1)\mathbf P=\mathcal N(a,1) and Q=N(b,1)\mathbf Q=\mathcal N(b,1):

KL(P,Q)=12(ab)2\text {KL}(\mathbf{P}, \mathbf{Q})=\frac 12(a-b)^2

Explanation can be found here.

Maximum likelihood estimation

Let (E,(Pθ)θΘ)(E, ( \mathbf P_\theta ) _ {\theta \in \Theta }) be a statistical model, and θ\theta^* is the true parameter, in order to find an estimator θ^\hat\theta, we can minimize the KL-divergence:

KL(Pθ,Pθ^)=xEpθlnpθ(x)pθ^(x)\text {KL}(P _ {\theta^*}, P _ {\hat\theta}) = \sum _ {x\in E} p _ {\theta^*} \ln \frac{p _ {\theta ^*}(x)}{p _ {\hat\theta}(x)}

This approach will naturally lead to the construction of the maximum likelihood estimator.

The KL divergence KL(P,Q)\text {KL}(\mathbf{P}, \mathbf{Q}) can be written as an expectation with respect to the distribution P\mathbf{P}. In general, it is easier to build an estimator for the KL divergence than it is to build an estimator for the total variation distance.

KL(Pθ,Pθ)=Eθ[ln(pθ(X)pθ(X))]=Eθ[lnpθ(X)]Eθ[lnpθ(X)]\begin{aligned} \text {KL}(\mathbf {P _ {\theta^*}}, \mathbf {P _ {\theta}}) &= \mathbb E _ {\theta^*}[\ln (\frac {p _ {\theta^*}(X)}{p _ {\theta}(X)})] \\ &= \mathbb E _ {\theta^*}[\ln {p _ {\theta^*}(X)}]-\mathbb E _ {\theta^*}[\ln {p _ {\theta}(X)}] \end{aligned}

The left part Eθ\*[lnpθ\*(X)]\mathbb E _ {\theta^\*}[\ln {p _ {\theta^\*}(X)}] is a constant CC. The right part can be estimated by an average, by the Law of Large Number (LLN). So the KL estimator can be written as below:

KL^(Pθ,Pθ)=C1ni=1nln(pθ(Xi))\widehat{\text {KL}}(\mathbf {P _ {\theta^*}}, \mathbf {P _ {\theta}}) = C-\frac 1n\sum _{i = 1}^ n \ln (p_\theta (X_ i))

We want to solve:

minθΘKL^(Pθ,Pθ)maxθΘ1ni=1nln(pθ(Xi))maxθΘln[i=1n(pθ(Xi))]maxθΘi=1n(pθ(Xi))\begin{aligned} \min _{\theta \in \Theta} \, \widehat{\text {KL}}(\mathbf {P _ {\theta^*}}, \mathbf {P _ {\theta}}) &\Leftrightarrow \max _ {\theta \in \Theta} \, \frac 1n\sum _ {i = 1}^ n \ln (p_\theta (X_ i)) \\ &\Leftrightarrow \max _ {\theta \in \Theta} \, \ln [\prod _ {i = 1}^ n (p_\theta (X_ i))] \\ &\Leftrightarrow \max _ {\theta \in \Theta} \, \prod _ {i = 1}^ n (p_\theta (X_ i)) \end{aligned}

This is the maximum likelihood principle.

The likelihood is the function:

Ln:En×ΘR(x1,,xn,θ)i=1npθ(xi)\begin{aligned} \displaystyle L_ n: E^ n \times \Theta &\to \mathbb {R} \\ (x_1, \ldots , x_ n, \theta ) &\mapsto \prod _{i = 1}^ n p_\theta (x_ i) \end{aligned}

For example: likelihood of a Bernoulli Statistical Model:

Ln(x1,,xn,p)=i=1n(xip+(1xi)(1p))=i=1n(pxi(1p)1xi)=pi=1nxi(1p)ni=1nxi\begin{aligned} L_ n(x_1, \ldots , x_ n, p) &= \prod _ {i = 1}^ n \left( x_ i p + (1 - x_ i) (1 - p) \right) \\ &= \prod _ {i = 1}^ n \left( p^{x_i} (1 - p)^{1 - x_ i} \right) \\ &= p^{\sum _ {i = 1}^ n x_ i} (1 -p)^{n- \sum _ {i = 1}^ n x_ i} \end{aligned}

For example, likelihood for the Gaussian Statistical Model:

Ln(x1,,xn;μ,σ2)=1(σ2π)nexp(12σ2i=1n(xiμ)2)L_ n(x_1, \ldots , x_ n; \mu, \sigma^2) = \frac 1{(\sigma \sqrt {2\pi})^n}\exp \bigg( -\frac 1{2\sigma^2} \sum _ {i = 1}^ n(x_i -\mu)^2 \bigg)

Maximum Likelihood Estimator

The maximum likelihood estimator for θ\theta^* is defined to be:

θ^nMLE=argmaxθΘL(x1,,xn,θ)=argmaxθΘ(i=1npθ(Xi))\begin{aligned} \hat\theta _ n ^{\text {MLE}} &= \arg\max _ {\theta \in \Theta} \, L(x_1, \ldots , x_ n, \theta)\\ &= \arg\max _ {\theta \in \Theta} \left( \prod _ {i = 1}^ n p _ \theta (X_ i) \right) \end{aligned}

And in practice, we use a lot the log-likelihood estimator:

θ^nMLE=argmaxθΘln[L(x1,,xn,θ)]\hat\theta _ n ^{\text {MLE}} = \arg\max _ {\theta \in \Theta} \, \ln [L(x_1, \ldots , x_ n, \theta)]

For example: Maximum Likelihood Estimator of a Poisson Statistical Model. Let X1,,XniidPoiss(λ\*)X_1, \ldots , X_ n \stackrel{iid}{\sim } \text {Poiss}(\lambda ^\*) for some unknown λ\*(0,)\lambda ^\* \in (0,\infty ). The associated statistical model is (N{0},{Poiss(λ)}λ(0,))(\mathbb {N} \cup \{ 0\} , \{ \text {Poiss}(\lambda )\} _ {\lambda \in (0,\infty )}). Likelihood of a Poisson Statistical Model can be written:

Ln(x1,,xn,λ)=i=1neλλxixi!=enλλi=1nxix1!xn!L_ n(x_1, \ldots , x_ n, \lambda ) = \prod _{i = 1}^ n e^{-\lambda } \frac{\lambda ^{x_ i}}{ {x_ i}!} = e^{-n \lambda } \frac{\lambda ^{\sum _{i = 1}^ n x_ i}}{x_1 ! \cdots x_ n !}

And the log-likelihood is: (λ):=lnLn(x1,,xn,λ)\ell (\lambda ) := \ln L _ n(x_1, \ldots , x _ n, \lambda ).

The derivative of the log-likelihood can be written:

λlnLn(x1,,xn,λ)=n+i=1nxiλ\frac{\partial }{\partial \lambda } \ln L_ n(x_1, \ldots , x_ n, \lambda ) = -n + \frac{\sum _{i = 1}^ n x_ i}{\lambda}

and if we set the above equation to 0, we can get: λ^nMLE=Xˉn\hat\lambda _ n ^{\text {MLE}} = \bar X _ n.

Consistency of MLE

Given i.i.d samples X1,,XnPθ\*X_1, \ldots , X _ n\sim \mathbf{P} _ {\theta ^\*} and an associated statistical model (E,{P_θ}θΘ)\left(E,\{ \mathbf{P} \_\theta \} _ {\theta \in \Theta }\right), the maximum likelihood estimator θ^nMLE\hat{\theta } _ n^{\text {MLE}} of θ\*\theta^\* is a consistent estimator under mild regularity conditions (e.g. continuity in θ\theta of the pdf p_θp\_ \theta almost everywhere), i.e.

θ^nMLEnPθ\hat\theta _ n^{\text {MLE}}\xrightarrow [n\to \infty ]{\mathbf P} \theta ^*

Note that this is true even if the parameter θ\theta is a vector in a higher dimensional parameter space Θ\Theta, and θ^nMLE\hat{\theta } _ n^{\text {MLE}} is a multivariate random variable, e.g. if θ=(μσ2)R2\theta =\begin{pmatrix} \mu \\ \sigma ^2\end{pmatrix}\in \mathbb {R}^2 for a Gaussian statistical model.

This can be proven by KL divergence: the true parameter θ\theta^* is identifiable.

Covariance

If XX and YY are random variables with respective means μX\mu_X and μY\mu_Y, then recall the covariance of XX and YY (written Cov(X,Y)\text {Cov} (X,Y)) is defined to be

Cov(X,Y)=E[(XμX)(YμY)]=E[XY]E[X]E[Y]=E[X(YE(Y)]=E[(XE(X)Y]\begin{aligned} \text {Cov}(X,Y) &= \mathbb E[(X - \mu _ X)(Y - \mu _ Y)] \\ &= \mathbb E[XY] - \mathbb E[X]\mathbb E[Y] \\ &= \mathbb E[X \cdot (Y-\mathbb E(Y)] \\ &= \mathbb E[(X-\mathbb E(X) \cdot Y] \end{aligned}

It shows that the covariance can be calculated with XX and YY both centered, or just one centered.

The properties of covariance:

  • Cov(X,X)=Var(X)\textsf{Cov}(X,X) = \textsf{Var}(X)
  • Cov(X,Y)=Cov(Y,X)\textsf{Cov}(X,Y) = \textsf{Cov}(Y,X)
  • Cov(aX+bY,Z)=aCov(X,Z)+bCov(Y,Z)\textsf{Cov}(aX + bY, Z) = a\textsf{Cov}(X,Z) + b\textsf{Cov}(Y,Z)
  • If XX and YY are independent, then Cov(X,Y)=0\textsf{Cov}(X,Y)=0.

In general, the converse of the last property is NOT true, except if (X,Y)T(X,Y)^T is a Gaussian vector. Think a counter example that E[XY]=0\mathbb E[XY]=0 and E[Y]=0\mathbb E[Y]=0, but X,YX,Y are not independent: Consider X which is Bernoulli(12)\textsf {Bernoulli}(\frac 12). Let YY be a random variable which is always 00 if X=0X=0, and uniformly distributed over {±1}\{\pm 1\} if X=1X=1. Notice that E[Y]=120+141+14(1)=0\mathbb E[Y] = \frac{1}{2} 0 + \frac{1}{4} 1 + \frac{1}{4}(-1) = 0. On the other hand, E[XY]=(00)12+(11)14+(11)14=0\mathbb E[XY] = (0 \cdot 0) \cdot \frac{1}{2} + (1 \cdot 1) \frac{1}{4} + (1 \cdot -1) \frac{1}{4} = 0. However, XX and YY are not independent.

Covariance matrix

Let X=(X(1)X(d))\mathbf X= \begin{pmatrix} X^{(1)}\\ \vdots \\ X^{(d)}\end{pmatrix}\, be a random vector of size d×1d \times 1. Let μE[X]\mu \triangleq \mathbb E[\mathbf X] denote the entry-wise mean, i.e.

E[X]=(E[X(1)]E[X(d)])\mathbb E[\mathbf X] = \begin{pmatrix} \mathbb E[X^{(1)}]\\ \vdots \\ \mathbb E[X^{(d)}]\end{pmatrix}

Then the covariance matrix Σ\Sigma can be written as:

Σ=E[(Xμ)(Xμ)T]\Sigma = \mathbb E[(\mathbf X- \mu )(\mathbf X- \mu )^ T]

This matrix has a size of d×dd \times d. The term on the iith row and jjth column is Σij=E[(X(i)μ(i))(X(j)μ(j))T]=Cov(X(i),X(j))\Sigma _ {ij} = \mathbb E[(\mathbf X ^ {(i)}- \mu ^ {(i)})(\mathbf X ^ {(j)}- \mu ^ {(j)} )^ T]=\textsf {Cov}(X ^ {(i)},X ^ {(j)}).

And: Cov(AX+B)=Cov(AX)=ACov(X)AT=AΣAT\textsf {Cov}(AX+B)=\textsf {Cov}(AX)=A \cdot \textsf {Cov}(X) \cdot A^T=A\Sigma A^T.

The multivariate Gaussian distribution

A random vector X=(X(1),,X(d))T\mathbf{X}=(X^{(1)},\ldots ,X^{(d)})^ T is a Gaussian vector, or multivariate Gaussian or normal variable, if any linear combination of its components is a (univariate) Gaussian variable or a constant (a “Gaussian” variable with zero variance), i.e., if αTX\alpha ^ T \mathbf{X} is (univariate) Gaussian or constant for any constant non-zero vector αRd\alpha \in \mathbb {R}^ d.

The distribution of XX, the dd-dimensional Gaussian or normal distribution, is completely specified by the vector mean μ=E[X]=(E[X(1)],,E[X(d)])T\mu =\mathbb E[\mathbf{X}]= (\mathbb E[X^{(1)}],\ldots ,\mathbb E[X^{(d)}])^ T and the d×dd\times d covariance matrix Σ\Sigma. If Σ\Sigma is invertible, then the pdf of XX is:

fX(x)=1(2π)ddet(Σ)e12(xμ)TΣ1(xμ),   xRdf _ {\mathbf X}(\mathbf x) = \frac{1}{\sqrt{\left(2\pi \right)^ d \textsf {det}(\Sigma )}}e^{-\frac{1}{2}(\mathbf x-\mu )^ T \Sigma ^{-1} (\mathbf x-\mu )}, ~ ~ ~ \mathbf x\in \mathbb {R}^ d

where det(Σ)\textsf {det}(\Sigma ) is the determinant of the Σ\Sigma, which is positive when Σ\Sigma is invertible.

In 2 dimensions (d=2d=2, (X,Y)T(X,Y)^T), its PDF depends on 5 parameters: E[X],Var(X),E[Y],Var(Y)\mathbb E[X], \textsf {Var}(X),\mathbb E[Y], \textsf {Var}(Y) and Cov(X,Y)\textsf {Cov}(X,Y).

If μ=0\mu=0 and Σ\Sigma is the identity matrix, then XX is called a standard normal random vector.

Note that when the covariant matrix Σ\Sigma is diagonal, the PDF factors into PDFs of univariate Gaussians, and hence the components are independent.

The multivariate CLT

The CLT may be generalized to averages or random vectors (also vectors of averages). Let X1,,XnRdX_1, \ldots,X_n \in \mathbb R^d be independent copies of a random vector XX such that E[X]=μ\mathbb E[X]=\mu, Cov(X)=Σ\textsf {Cov}(X)=\Sigma,

n(Xˉnμ)n(d)Nd(0,Σ)\sqrt n (\bar X_n-\mu)\xrightarrow [n\to \infty]{(d)}\mathcal{N} _d (0,\Sigma)

Equivalently: nΣ12(Xˉnμ)n(d)Nd(0,Id)\sqrt n \Sigma^{-\frac 12}(\bar X_n-\mu)\xrightarrow [n\to \infty]{(d)}\mathcal{N} _ d (0,I_d)

Multivariate Delta method

Let (Tn)n1(T_n) _ {n \geq 1} sequence of random vectors in Rd\mathbb R^d such that:

n(Tnθ)n(d)Nd(0,Σ)\sqrt{n}(T_ n - \theta ) \xrightarrow [n \to \infty ]{(d)}\mathcal{N} _d (0,\Sigma)

for some θR\theta \in \mathbb R and some covariance ΣRd×d\Sigma \in \mathbb R ^ {d \times d}.

Let g:RdRk(k1)g: \mathbb R^d \to \mathbb R^k (k\geq 1) be continuously differentiable at θ\theta. Then:

n(g(Tn)g(θ))n(d)Nk(0,g(θ)TΣg(θ))\sqrt{n}(g(T_ n) - g(\theta)) \xrightarrow [n \to \infty ]{(d)}\mathcal{N} _k (0,\nabla g(\theta)^T\Sigma \nabla g(\theta))

where:

g(θ)=gθ(θ)=(gjθi)0id0jkRd×k\nabla g(\theta)=\frac {\partial g}{\partial \theta} (\theta)=(\frac {\partial g_j}{\partial \theta_i}) _ {0\leq i\leq d \atop 0\leq j \leq k} \in \mathbb R^{d \times k}

Fisher Information

Define the log-likelihood for one observation as:

(θ)=lnL1(X,θ),θΘRd\ell (\theta ) = \ln L _ 1(X, \theta ), \quad \theta \in \Theta \subset \mathbb R^d

Assume that \ell is a.s. twice differentiable. Under some regularity conditions, the Fisher information of the statistical model is defined as:

I(θ)=Cov((θ))=E[H(θ)]\mathcal{I}(\theta ) = \textsf{Cov}(\nabla \ell (\theta )) = -\mathbb E\left[\mathbf{H}\ell (\theta )\right]

If ΘR\Theta \subset \mathbb R, we get:

I(θ)=Var[(θ)]=E[((θ)]\mathcal{I}(\theta ) = \textsf{Var}[\ell '(\theta )]=-\mathbb E[(\ell ''(\theta)]

For example, let XBer(p)X \sim \textsf {Ber}(p):

(p)=Xln(p)+(1X)ln(1p)(p)=Xp+1X1pVar[(p)]=1p(1p)(p)=Xp21X(1p)2E[(p)]=1p(1p)\ell(p)=X \ln (p)+(1-X)\ln(1-p) \\ \ell'(p)=\frac{X}{p}+\frac{1-X}{1-p} \qquad \textsf{Var}[\ell'(p)]=\frac 1{p(1-p)} \\ \ell''(p)=-\frac{X}{p^2}-\frac{1-X}{(1-p)^2} \qquad -\mathbb E[\ell''(p)]=\frac 1{p(1-p)} \\

So we can see that the fisher information of Bernoulli distribution is I=1p(1p)\mathcal I=\frac 1{p(1-p)}.

The fisher information of common distributions can be easily found on the Wikipedia’s pages. Like this one: wiki/Bernoulli_distribution

Asymptotic normality of the MLE

Let θΘ\theta^* \in \Theta (the true parameter). Assume the following:

  1. The parameter is identifiable.
  2. For all θΘ\theta \in \Theta, the support of Pθ\mathbf P _ \theta does not depend on θ\theta;
  3. θ\theta^* is not on the boundary of Θ\Theta;
  4. I(θ)\mathcal I(\theta) is invertible in a neighborhood of θ\theta ^*;
  5. A few more technical conditions.

Then, θ^nMLE\hat\theta _ n ^{\textsf {MLE}} satisfies:

  • θ^nMLEnPθ\*\hat\theta _ n ^{\textsf {MLE}}\xrightarrow [n\to \infty]{\mathbf P}\theta^\*, w.r.t. Pθ\*\mathbf P_ {\theta ^\*};

  • n(θ^nMLEθ)n(d)N_d(0,I(θ\*)1)\sqrt n (\hat\theta _ n ^{\textsf {MLE}}-\theta ^*)\xrightarrow [n\to \infty]{(d)}\mathcal{N} \_d (0,\mathcal I(\theta ^\*)^{-1}), w.r.t. P_θ\*\mathbf P\_ {\theta ^\*}.

The Fisher information I(θ)\mathcal{I}(\theta) at the true parameter determines the asymptotic variance of the random variable θ^nMLE\hat\theta _ n ^{\textsf {MLE}}.

The method of moments

Moments

Let X1,,XnX_1, \ldots,X_n be i.i.d. sample associated with a statistical model (E,(Pθ)θΘ)(E, ( \mathbf P_\theta ) _ {\theta \in \Theta }), assume that ERE \subset \mathbb R, and ΘRd\Theta \subset \mathbb R^d, for some d1d \geq 1.

Population moments: Let mk(θ)=Eθ[X1k]m_k(\theta)=\mathbb E _ \theta [X _1^k], 1kd1 \leq k \leq d

Empirical moments: Let m^k(θ)=Xnk=1ni=1nXik\hat m_k(\theta)=\overline{X _n^k}=\frac 1n\sum _ {i=1}^nX_i^k

The k moment is the mean (expectation) of XkX^k.

From LLN,

m^k(θ)nP/a.s.mk(θ)\hat m_k(\theta) \xrightarrow [n\to \infty]{\mathbf P/a.s.}m_k(\theta)

More compactly, we say that the whole vector converges:

(m^1,,m^d)nP/a.s.(m1,,md)(\hat m_1,\ldots,\hat m_d) \xrightarrow [n\to \infty]{\mathbf P/a.s.}(m_1,\ldots,m_d)

Moments estimator

Let:

M:ΘRdθM(θ)=(m1(θ),,md(θ))\begin{aligned} M : \Theta &\to \mathbb {R}^d \\ \theta &\mapsto M(\theta)=(m_1(\theta),\ldots,m_d(\theta)) \end{aligned}

Assume MM is one to one:

θ=M1(m1(θ),,md(θ)\theta=M^{-1}(m_1(\theta),\ldots,m_d(\theta)

The definition of moments estimator of θ\theta:

θ^nMM=M1(m^1,,m^d)\hat\theta _ n ^{\textsf {MM}}=M^{-1}(\hat m_1,\ldots,\hat m_d)

provided it exists.

For example: let (R,{N(μ,σ2)}μR,σ>0)(\mathbb {R}, \{ N(\mu , \sigma ^2)\} _ {\mu \in \mathbb {R}, \sigma > 0}) be the statistical model of a normal random variable XX. Let

mk(μ,σ)=E[Xk]m_ k(\mu , \sigma ) = \mathbb {E}[X^ k]

Then: m1(μ,σ)=μ,m2(μ,σ)=μ2+σ2m _ 1(\mu , \sigma )=\mu, \quad m _ 2(\mu , \sigma )=\mu^2+\sigma^2

Mapping parameters to moments. Let:

ψ:R×(0,)R2(μ,σ)(m1(μ,σ),m2(μ,σ))\begin{aligned} \psi : \mathbb {R} \times (0, \infty ) &\to \mathbb {R}^2 \\ (\mu , \sigma ) &\mapsto (m_1(\mu , \sigma ), m_2(\mu , \sigma )) \end{aligned}

ψ\psi is one-to-one on the given domain R×(0,)\mathbb {R} \times (0, \infty ) and ψ(μ,σ)=(m1,m2)\psi (\mu , \sigma ) = (m_1, m_2), then:

μ=m1σ=m2m12\begin{aligned} \mu &= m_1 \\ \sigma &= \sqrt {m_2-m_1^2} \end{aligned}

Generalized method of moments

Under some technical conditions, the method of moments estimator θ^nMM\hat\theta _ n ^{\textsf {MM}} is asymptotically normal. Applying the multivariate CLT and Delta method yields:

n(θ^nMMθ)n(d)N(0,Γ(θ))\sqrt{n}(\widehat{\theta }_ n^{\text {MM}} - \theta ^*) \xrightarrow [n \to \infty ]{(d)} \mathcal{N}(0, \Gamma(\theta))

The quantity Γ(θ)\Gamma(\theta) above is referred to as the asymptotic variance.

Γ(θ)=[M1θ(M(θ))]TΣ(θ)[M1θ(M(θ))]\Gamma(\theta)=\bigg[\frac {\partial M^{-1}}{\partial \theta}(M(\theta))\bigg]^T\Sigma(\theta)\bigg[\frac {\partial M^{-1}}{\partial \theta}(M(\theta))\bigg]

MLE vs. Moment estimator

  • Comparison of the quadratic risks: In general, the MLE is more accurate.
  • MLE still gives good results if model is misspecified.
  • Computational issues: Sometimes, the MLE is intractable but MM is easier (polynomial equations).

M-estimation

M-estimation involves estimating some parameter of interest related to the underlying, unknown distribution (e.g. its mean, variance, or quantiles). Unlike maximum likelihood estimation and the method of moments, no statistical model needs to be assumed to perform M-estimation. M-estimation can be used in both a parametric and non-parametric context.

The definition of M-estimation:

Let X1,,XnX_1, \ldots , X _ n be i.i.d. with some unknown distribution P\mathbf{P} and an associated parameter μ\mu^* on a sample space EE. We make no modeling assumption that P\mathbf{P} is from any particular family of distributions.

An M-estimator of the parameter μ\mu ^* is the argmin of an estimator of a function Q(μ)\mathcal{Q}(\mu ) of the parameter which satisfies the following:

  • Q(μ)=E[ρ(X,μ)]\mathcal{Q}(\mu )=\mathbb E\left[\rho (X,\mu ) \right] for some function ρ:E×MR\rho :E\times \mathcal{M}\to \mathbb {R}, where M\mathcal M is the set of all possible values of the unknown true parameter μ\mu ^*;
  • Q(μ)\mathcal{Q}(\mu ) attains a unique minimum at μ=μ\*\mu = \mu ^\*, in M\mathcal M. That is, argminμMQ(μ)=μ\*\textsf {argmin} _ {\mu \in \mathcal{M}}\mathcal{Q}(\mu ) \, =\, \mu ^\*.

In general, the goal is to find the loss function ρ\rho such Q(μ)=E[ρ(X,μ)]\mathcal{Q}(\mu )=\mathbb E\left[\rho (X,\mu ) \right] has the properties stated above.

Note that the function ρ(X,μ)\rho (X,\mu ) is in particular a function of the random variable XX, and the expectation in E[ρ(X,μ)]\mathbb E[\rho (X,\mu )] is to be taken against the true distribution P\mathbf{P} of XX, with associated parameter value μ\mu ^*.

Because Q(μ)\mathcal{Q}(\mu ) is an expectation, we can construct a (consistent) estimator of Q(μ)\mathcal{Q}(\mu ) by replacing the expectation in its definition by the sample mean.

Maximum likelihood estimation is a special case of M-estimation. In MLE case, the loss function ρ(Xi,μ)=lnpθ(Xi)\rho (X_i,\mu )=-\ln p _ \theta (X_ i).

Mean as a Minimizer

In 1-d case, let ERE \in \mathbb R, and MR\mathcal M \in \mathbb R, if ρ(X,μ)=(Xμ)2\rho (X,\mu )=(X-\mu)^2, then μ\mu^* is the mean of XX, a.k.a. E[X]\mathbb E[X].

Proof:

Q(μ)=E[ρ(X,μ)]=E[(Xμ)2]=(xμ)2f(x)dx=(x22μx+μ2)f(x)dxQ(μ)=ddμQ(μ)=ddμ((x22μx+μ2)f(x)dx)=2xf(x)dx+2μf(x)dx=2E(X)+2μ\begin{aligned} \mathcal{Q}(\mu )=\mathbb E[\rho (X,\mu )]&=\mathbb E[(X-\mu)^2] \\ &=\int _{-\infty }^\infty (x - \mu)^2 f(x) \, dx \\ &=\int _{-\infty }^\infty (x^2 - 2\mu x + \mu^2) f(x) \, dx \\ \mathcal{Q}'(\mu )=\frac{d}{d\mu } \mathcal{Q}(\mu )&=\frac{d}{d \mu } \bigg(\int _{-\infty }^\infty (x^2 - 2\mu x + \mu^2) f(x) \, dx \bigg) \\ &=-2\int _{-\infty }^\infty xf(x)dx+2\mu\int _{-\infty }^\infty f(x)dx \\ &=-2\mathbb E(X)+2\mu \end{aligned}

If we set Q(μ)=0\mathcal{Q}'(\mu )=0, we can get μ=E[X]\mu^*=\mathbb E[X].

It also works when ERdE \in \mathbb R^d, and MRd\mathcal M \in \mathbb R^d. Just let ρ(X,μ)=Xμ22\rho (X,\mu )= {\Vert X-\mu \Vert} _ 2^2, then μ=E[X]Rd\mu^*=\mathbb E[X] \in \mathbb R^d.

Median as a Minimizer

In 1-d case, let ERE \in \mathbb R, and MR\mathcal M \in \mathbb R, if ρ(X,μ)=Xμ\rho (X,\mu )= \vert X-\mu \vert, then μ\mu^* is the median of XX.

Proof:

E[Xμ]=xμf(x)dx=μ(xμ)f(x)dx+μ(x+μ)f(x)dx=μxf(x)dxμxf(x)dxμ(μf(x)dxμf(x)dx)Q(μ)=ddμQ(μ)=μf(μ)μf(μ)(μf(x)μf(x)dx)μ(f(μ)f(μ))=μf(x)dx+μf(x)dx\begin{aligned} \mathbb E[\vert X-\mu \vert] &=\int _{-\infty }^\infty \vert x-\mu \vert f(x) \, dx \\ &=\int _{\mu }^\infty (x - \mu ) f(x) \, dx + \int _{-\infty }^\mu (-x + \mu ) f(x) \, dx \\ &=\int _{\mu }^\infty x f(x) dx - \int _{-\infty }^\mu x f(x) dx \\ &\quad - \mu \left(\int _\mu ^\infty f(x) dx -\int _{-\infty }^\mu f(x) dx \right) \\ \mathcal{Q}'(\mu )&=\frac{d}{d\mu } \mathcal{Q}(\mu ) \\ &=-\mu f(\mu)-\mu f(\mu)-\left(\int _\mu ^\infty f(x) - \int _{-\infty }^\mu f(x) dx\right) \\ &\quad - \mu (-f\left(\mu ) - f(\mu )\right) \\ &= - \int _\mu ^\infty f(x) \, dx + \int _{-\infty }^\mu f(x) \, dx \end{aligned}

If we set Q(μ)=0\mathcal{Q}'(\mu )=0, we can get μf(x)dx=μf(x)dx\int _ {-\infty }^\mu f(x) \, dx=\int _\mu ^\infty f(x) \, dx, by the definition of median, μ\mu^* is the median of XX.

Quantile as a Minimizer

The check function is defined as:

Cα(x)={(1α)xif x<0αxif x0C_\alpha (x) = \begin{cases} -(1-\alpha )x& \text {if }x<0\\ \alpha x & \text {if }x\geq 0 \end{cases}

Assume that XX is a continuous random variable with density f:RRf: \mathbb {R} \to \mathbb {R}. Define the α\alpha-quantile of XX to be QX(α)RQ_ X(\alpha ) \in \mathbb {R} such that

P(XQX(α))=α\mathbf{P}\left(X \leq Q_ X(\alpha )\right) = \alpha

(It is different from the quantile function for a standard normal distribution, qαq_\alpha is such that P(X>q(α))=α\mathbf{P}(X > q(\alpha )) = \alpha.)

We have to proof that any α-quantile of X satisfies: QX(α)=argminμRE[Cα(Xμ)]Q _ {X}(\alpha ) = \text {argmin} _ {\mu \in \mathbb {R}}\mathbb E[C _ \alpha (X - \mu )].

Proof:

E[Cα(Xμ)]=Cα(Xμ)f(x)dx=μα(xμ)f(x)dxμ(1α)(xμ)f(x)dx=αμxf(x)dxαμμf(x)dx(1α)μxf(x)dx+(1α)μμf(x)dxQ(μ)=ddμQ(μ)=αμf(μ)αμf(x)dx+αμf(μ)(1α)μf(μ)+(1α)μf(x)dx+(1α)μf(μ)=μf(x)dxα(μf(x)dx+μf(x)dx)=μf(x)dxα\begin{aligned} \mathbb E[C_\alpha (X - \mu )] &=\int _{-\infty }^\infty C_\alpha (X - \mu ) f(x) \, dx \\ &=\int _{\mu }^\infty \alpha(x - \mu ) f(x) \, dx - \int _{-\infty }^\mu (1-\alpha)(x - \mu ) f(x) \, dx \\ &=\alpha \int _{\mu }^\infty x f(x) dx - \alpha \mu \int _{\mu }^\infty f(x) dx \\ &\quad - (1-\alpha)\int _{-\infty }^\mu x f(x) dx + (1-\alpha)\mu\int _{-\infty }^\mu f(x) dx \\ \mathcal{Q}'(\mu )&=\frac{d}{d\mu } \mathcal{Q}(\mu ) \\ &=-\alpha\mu f(\mu)-\alpha \int _{\mu }^\infty f(x) dx +\alpha \mu f(\mu) \\ &\quad -(1-\alpha)\mu f(\mu)+(1-\alpha)\int _{-\infty }^\mu f(x) dx+(1-\alpha)\mu f(\mu)\\ &= \int _{-\infty }^\mu f(x) dx-\alpha \left( \int _{-\infty }^\mu f(x) dx+\int _{\mu }^\infty f(x) dx\right)\\ &= \int _{-\infty }^\mu f(x) dx-\alpha \end{aligned}

If we set Q(μ)=0\mathcal{Q}'(\mu )=0, we can get α=μ\*f(x)dx\alpha=\int _ {-\infty }^{\mu^\*} f(x) \, dx, by definition μ\*\mu^\* is the α\alpha-quantile of XX.

Asymptotic Normality of M-estimators

The J\mathbf J and K\mathbf K matrices:

J=E[Hρ]=E[(2ρμ1μ1(X1,μ)2ρμ1μd(X1,μ)2ρμdμ1(X1,μ)2ρμdμd(X1,μ))](d×d)K=Cov[ρ(X1,μ)]=Cov[(ρμ1(X1,μ)ρμd(X1,μ))](d×d)\begin{aligned} \mathbf{J}&=\mathbb E[\mathbf{H}\rho] \\ &=\mathbb E\left[\begin{pmatrix} \frac{\partial ^2 \rho }{\partial \mu _1 \partial \mu _1} (\mathbf{X}_1, \vec{\mu })& \ldots & \frac{\partial ^2 \rho }{\partial \mu _1 \partial \mu _ d} (\mathbf{X}_1, \vec{\mu })\\ \vdots & \ddots & \vdots \\ \frac{\partial ^2 \rho }{\partial \mu _ d \partial \mu _1} (\mathbf{X}_1, \vec{\mu })& \ldots & \frac{\partial ^2 \rho }{\partial \mu _ d \partial \mu _ d} (\mathbf{X}_1, \vec{\mu })\end{pmatrix}\right]\quad (d\times d) \\ \\ \mathbf{K}&=\textsf{Cov}\left[\nabla \rho (\mathbf{X}_1,\vec{\mu })\right] \\ &=\textsf{Cov}\left[\begin{pmatrix} \frac{\partial \rho }{\partial \mu _1 } (\mathbf{X}_1, \vec{\mu })\\ \vdots \\ \frac{\partial \rho }{\partial \mu _ d } (\mathbf{X}_1, \vec{\mu })\end{pmatrix}\right]\quad (d\times d) \end{aligned}

In one dimension, i.e. d=1d=1, the matrices reduce to the following:

J=E[2ρμ2(X1,μ)]K=Var[ρμ(X1,μ)]\begin{aligned} \mathbf J&= \mathbb E\left[ \frac{\partial ^2 \rho }{\partial \mu ^2} (X_1, \mu ) \right] \\ \mathbf K&= \text {Var}\left[ \frac{\partial \rho }{\partial \mu }(X_1, \mu ) \right] \end{aligned}

In the log-likelihood case (write μ=θ\mu=\theta), both of these functions are equal to the Fisher information:

J(θ)=K(θ)=I(θ)\mathbf J(\theta)=\mathbf K(\theta)=\mathcal I(\theta)

Be careful that, in MLE case, the loss function is ρ(Xi,μ)=lnpθ(Xi)\rho (X_i,\mu )=-\ln p _ \theta (X_ i), which is the negative of log-likelihood. With MLE we max-imize the objective function, whereas with M-estimation, we mini-mize the objective function.

Under some technical conditions, the functions J(μ)\mathbf J(\mu) and K(μ)\mathbf K(\mu) determine the asymptotic variance of the M-estimator μ^\hat\mu:

μ^nnPμn(μ^nμ)n(d)N(0,J(μ)1K(μ)J(μ)1)\begin{aligned} \widehat{\mu }_ n &\xrightarrow [n \to \infty ]{\mathbf P} \mu^* \\ \\ \sqrt{n} (\widehat{\mu }_ n - \mu ^*) &\xrightarrow [n \to \infty ]{(d)} \mathcal N(0,\mathbf J(\mu ^*)^{-1} \mathbf K(\mu ^*) \mathbf J(\mu ^*)^{-1}) \end{aligned}

M-estimators in robust statistics

When estimators are more resilient to corruptions or mistakes in the data than others, such estimators are referred to as robust .

The empirical median is more robust than the empirical mean. However, the median estimator takes absolute function as loss function: ρ(X,μ)=Xμ\rho (X,\mu )= \vert X-\mu \vert, which is not a continuous function at μ\mu. So there is no way to get J\mathbf J and K\mathbf K. To bypass this problem, we can use Huber’s Loss, which is defined as:

hδ(x)={x22ifxδδ(xδ/2)ifx>δh_\delta (x) = \begin{cases} \frac{x^2}{2} \quad \text {if} \, \, \vert x \vert \le \delta \\ \delta ( \vert x \vert - \delta /2 ) \quad \text {if} \, \, \vert x \vert > \delta \end{cases}

This Huber’s loss is differentiable everywhere.