When the parameter is not directly the expectation of samples (E(X)), three estimation methods will be presented: Maximum likelihood estimation, Method of moments, and M-estimators.
Distance measures between distributions
Two methods are presented to measure the distance between distributions: Total variation distance and Kullback-Leibler (KL) divergence.
Total variation (TV) distance
Let (E,(Pθ)θ∈Θ) be a statistical model, and θ∗ is the true parameter, the statistician’s goal is that given X1,X2,…,Xn, find an estimator θ^=θ^(X1,X2,…,Xn) such that Pθ^ is close to Pθ\*. This means: ∣Pθ^−Pθ\*∣ is small for all A⊂E. Here A is a sub sample space.
The total variation distance between two probability measures Pθ and Pθ′with sample space E is defined by:
TV(Pθ,Pθ′)=maxA⊂EPθ(A)−Pθ′(A)
Let P and Q be probability measures with a sample space E and probability mass functions f and g. Then, the total variation distance between P and Q:
TV(P,Q)=maxA⊂E∣P(A)−Q(A)∣
If E is discrete (total variation distance between discrete measures)
TV(P,Q)=21∑x∈E∣f(x)−g(x)∣
if E is continuous (total variation distance between continuous measures)
TV(P,Q)=21∫x∈E∣f(x)−g(x)∣dx
It can be imagined as the half of the sum of the surface difference between their PDF. Here 21 is the normalization.
Properties of Total Variation Distance
d is a distance on probability measures:
Symmetric: d(P,Q)=d(Q,P)
Nonnegative: d(P,Q)≥0
Definite: d(P,Q)=0⟺P=Q
Triangle inequality: d(P,V)≤d(P,Q)+d(Q,V)
The total variation distance (TV) is a distance on probability measures.
Kullback-Leibler (KL) divergence
Let P and Q be discrete probability distributions with PMFs p and q respectively. Let’s also assume P and Q have a common sample space E. Then the KL divergence (also known as relative entropy ) between P and Q is defined by:
KL(P,Q)=∑x∈Ep(x)ln(q(x)p(x))
where the sum is only over the support of P.
If P and Q are continuous probability distributions with PDFs p and q on a common sample space E, then:
KL(P,Q)=∫x∈Ep(x)ln(q(x)p(x))dx
where the sum is again only over the support of P.
Properties of KL-divergence
No-Symmetric: KL(P,Q)=KL(Q,P)
Nonnegative: KL(P,Q)≥0
Definite: KL(P,Q)=0⟺P=Q
No triangle inequality: KL(P,V)≰KL(P,Q)+KL(Q,V) in general
The Kullback-Leibler (KL) divergence is NOT a distance.
For example, KL divergence between 2 Gaussian distributions P=N(a,1) and Q=N(b,1):
Let (E,(Pθ)θ∈Θ) be a statistical model, and θ∗ is the true parameter, in order to find an estimator θ^, we can minimize the KL-divergence:
KL(Pθ∗,Pθ^)=∑x∈Epθ∗lnpθ^(x)pθ∗(x)
This approach will naturally lead to the construction of the maximum likelihood estimator.
The KL divergence KL(P,Q) can be written as an expectation with respect to the distribution P. In general, it is easier to build an estimator for the KL divergence than it is to build an estimator for the total variation distance.
The left part Eθ\*[lnpθ\*(X)] is a constant C. The right part can be estimated by an average, by the Law of Large Number (LLN). So the KL estimator can be written as below:
And in practice, we use a lot the log-likelihood estimator:
θ^nMLE=argmaxθ∈Θln[L(x1,…,xn,θ)]
For example: Maximum Likelihood Estimator of a Poisson Statistical Model. Let X1,…,Xn∼iidPoiss(λ\*) for some unknown λ\*∈(0,∞). The associated statistical model is (N∪{0},{Poiss(λ)}λ∈(0,∞)). Likelihood of a Poisson Statistical Model can be written:
And the log-likelihood is: ℓ(λ):=lnLn(x1,…,xn,λ).
The derivative of the log-likelihood can be written:
∂λ∂lnLn(x1,…,xn,λ)=−n+λ∑i=1nxi
and if we set the above equation to 0, we can get: λ^nMLE=Xˉn.
Consistency of MLE
Given i.i.d samples X1,…,Xn∼Pθ\* and an associated statistical model (E,{P_θ}θ∈Θ), the maximum likelihood estimator θ^nMLE of θ\* is a consistent estimator under mild regularity conditions (e.g. continuity in θ of the pdf p_θ almost everywhere), i.e.
θ^nMLEPn→∞θ∗
Note that this is true even if the parameter θ is a vector in a higher dimensional parameter spaceΘ, and θ^nMLE is a multivariate random variable, e.g. if θ=(μσ2)∈R2 for a Gaussian statistical model.
This can be proven by KL divergence: the true parameter θ∗ is identifiable.
Covariance
If X and Y are random variables with respective means μX and μY, then recall the covariance of X and Y (written Cov(X,Y)) is defined to be
It shows that the covariance can be calculated with X and Y both centered, or just one centered.
The properties of covariance:
Cov(X,X)=Var(X)
Cov(X,Y)=Cov(Y,X)
Cov(aX+bY,Z)=aCov(X,Z)+bCov(Y,Z)
If X and Y are independent, then Cov(X,Y)=0.
In general, the converse of the last property is NOT true, except if (X,Y)T is a Gaussian vector. Think a counter example that E[XY]=0 and E[Y]=0, but X,Y are not independent: Consider X which is Bernoulli(21). Let Y be a random variable which is always 0 if X=0, and uniformly distributed over {±1} if X=1. Notice that E[Y]=210+411+41(−1)=0. On the other hand, E[XY]=(0⋅0)⋅21+(1⋅1)41+(1⋅−1)41=0. However, X and Y are not independent.
Covariance matrix
Let X=X(1)⋮X(d) be a random vector of size d×1. Let μ≜E[X] denote the entry-wise mean, i.e.
E[X]=E[X(1)]⋮E[X(d)]
Then the covariance matrixΣ can be written as:
Σ=E[(X−μ)(X−μ)T]
This matrix has a size of d×d. The term on the ith row and jth column is Σij=E[(X(i)−μ(i))(X(j)−μ(j))T]=Cov(X(i),X(j)).
And: Cov(AX+B)=Cov(AX)=A⋅Cov(X)⋅AT=AΣAT.
The multivariate Gaussian distribution
A random vector X=(X(1),…,X(d))T is a Gaussian vector, or multivariate Gaussian or normal variable, if any linear combination of its components is a (univariate) Gaussian variable or a constant (a “Gaussian” variable with zero variance), i.e., if αTX is (univariate) Gaussian or constant for any constant non-zero vector α∈Rd.
The distribution of X, the d-dimensional Gaussian or normal distribution, is completely specified by the vector mean μ=E[X]=(E[X(1)],…,E[X(d)])T and the d×d covariance matrix Σ. If Σ is invertible, then the pdf of X is:
fX(x)=(2π)ddet(Σ)1e−21(x−μ)TΣ−1(x−μ),x∈Rd
where det(Σ) is the determinant of the Σ, which is positive when Σ is invertible.
In 2 dimensions (d=2, (X,Y)T), its PDF depends on 5 parameters: E[X],Var(X),E[Y],Var(Y) and Cov(X,Y).
If μ=0 and Σ is the identity matrix, then X is called a standard normal random vector.
Note that when the covariant matrix Σ is diagonal, the PDF factors into PDFs of univariate Gaussians, and hence the components are independent.
The multivariate CLT
The CLT may be generalized to averages or random vectors (also vectors of averages). Let X1,…,Xn∈Rd be independent copies of a random vector X such that E[X]=μ, Cov(X)=Σ,
n(Xˉn−μ)(d)n→∞Nd(0,Σ)
Equivalently:
nΣ−21(Xˉn−μ)(d)n→∞Nd(0,Id)
Multivariate Delta method
Let (Tn)n≥1 sequence of random vectors in Rd such that:
n(Tn−θ)(d)n→∞Nd(0,Σ)
for some θ∈R and some covariance Σ∈Rd×d.
Let g:Rd→Rk(k≥1) be continuously differentiable at θ. Then:
n(g(Tn)−g(θ))(d)n→∞Nk(0,∇g(θ)TΣ∇g(θ))
where:
∇g(θ)=∂θ∂g(θ)=(∂θi∂gj)0≤j≤k0≤i≤d∈Rd×k
Fisher Information
Define the log-likelihood for one observation as:
ℓ(θ)=lnL1(X,θ),θ∈Θ⊂Rd
Assume that ℓ is a.s. twice differentiable. Under some regularity conditions, the Fisher information of the statistical model is defined as:
The Fisher information I(θ) at the true parameter determines the asymptotic variance of the random variable θ^nMLE.
The method of moments
Moments
Let X1,…,Xn be i.i.d. sample associated with a statistical model (E,(Pθ)θ∈Θ), assume that E⊂R, and Θ⊂Rd, for some d≥1.
Population moments: Let mk(θ)=Eθ[X1k], 1≤k≤d
Empirical moments: Let m^k(θ)=Xnk=n1∑i=1nXik
The k moment is the mean (expectation) of Xk.
From LLN,
m^k(θ)P/a.s.n→∞mk(θ)
More compactly, we say that the whole vector converges:
(m^1,…,m^d)P/a.s.n→∞(m1,…,md)
Moments estimator
Let:
M:Θθ→Rd↦M(θ)=(m1(θ),…,md(θ))
Assume M is one to one:
θ=M−1(m1(θ),…,md(θ)
The definition of moments estimator of θ:
θ^nMM=M−1(m^1,…,m^d)
provided it exists.
For example: let (R,{N(μ,σ2)}μ∈R,σ>0) be the statistical model of a normal random variable X. Let
mk(μ,σ)=E[Xk]
Then: m1(μ,σ)=μ,m2(μ,σ)=μ2+σ2
Mapping parameters to moments. Let:
ψ:R×(0,∞)(μ,σ)→R2↦(m1(μ,σ),m2(μ,σ))
ψ is one-to-one on the given domain R×(0,∞) and ψ(μ,σ)=(m1,m2), then:
μσ=m1=m2−m12
Generalized method of moments
Under some technical conditions, the method of moments estimator θ^nMM is asymptotically normal. Applying the multivariate CLT and Delta method yields:
n(θnMM−θ∗)(d)n→∞N(0,Γ(θ))
The quantity Γ(θ) above is referred to as the asymptotic variance.
Γ(θ)=[∂θ∂M−1(M(θ))]TΣ(θ)[∂θ∂M−1(M(θ))]
MLE vs. Moment estimator
Comparison of the quadratic risks: In general, the MLE is more accurate.
MLE still gives good results if model is misspecified.
Computational issues: Sometimes, the MLE is intractable but MM is easier (polynomial equations).
M-estimation
M-estimation involves estimating some parameter of interest related to the underlying, unknown distribution (e.g. its mean, variance, or quantiles). Unlike maximum likelihood estimation and the method of moments, no statistical model needs to be assumed to perform M-estimation. M-estimation can be used in both a parametric and non-parametric context.
The definition of M-estimation:
Let X1,…,Xn be i.i.d. with some unknown distribution P and an associated parameter μ∗ on a sample space E. We make no modeling assumption that P is from any particular family of distributions.
An M-estimator of the parameter μ∗ is the argmin of an estimator of a functionQ(μ)of the parameter which satisfies the following:
Q(μ)=E[ρ(X,μ)] for some function ρ:E×M→R, where M is the set of all possible values of the unknown true parameter μ∗;
Q(μ) attains a unique minimum at μ=μ\*, in M. That is, argminμ∈MQ(μ)=μ\*.
In general, the goal is to find the loss functionρ such Q(μ)=E[ρ(X,μ)] has the properties stated above.
Note that the function ρ(X,μ) is in particular a function of the random variable X, and the expectation in E[ρ(X,μ)] is to be taken against the true distributionP of X, with associated parameter value μ∗.
Because Q(μ) is an expectation, we can construct a (consistent) estimator of Q(μ) by replacing the expectation in its definition by the sample mean.
Maximum likelihood estimation is a special case of M-estimation. In MLE case, the loss function ρ(Xi,μ)=−lnpθ(Xi).
Mean as a Minimizer
In 1-d case, let E∈R, and M∈R, if ρ(X,μ)=(X−μ)2, then μ∗ is the mean of X, a.k.a. E[X].
In one dimension, i.e. d=1, the matrices reduce to the following:
JK=E[∂μ2∂2ρ(X1,μ)]=Var[∂μ∂ρ(X1,μ)]
In the log-likelihood case (write μ=θ), both of these functions are equal to the Fisher information:
J(θ)=K(θ)=I(θ)
Be careful that, in MLE case, the loss function is ρ(Xi,μ)=−lnpθ(Xi), which is the negative of log-likelihood. With MLE we max-imize the objective function, whereas with M-estimation, we mini-mize the objective function.
Under some technical conditions, the functions J(μ) and K(μ) determine the asymptotic variance of the M-estimator μ^:
When estimators are more resilient to corruptions or mistakes in the data than others, such estimators are referred to as robust .
The empirical median is more robust than the empirical mean. However, the median estimator takes absolute function as loss function: ρ(X,μ)=∣X−μ∣, which is not a continuous function at μ. So there is no way to get J and K. To bypass this problem, we can use Huber’s Loss, which is defined as: