Unit 3 - Methods of Estimation

When the parameter is not directly the expectation of samples (), three estimation methods will be presented: Maximum likelihood estimation, Method of moments, and M-estimators.

Distance measures between distributions

Two methods are presented to measure the distance between distributions: Total variation distance and Kullback-Leibler (KL) divergence.

Total variation (TV) distance

Let be a statistical model, and is the true parameter, the statistician’s goal is that given , find an estimator such that is close to . This means: is small for all . Here is a sub sample space.

The total variation distance between two probability measures and with sample space E is defined by:

Let and be probability measures with a sample space and probability mass functions and . Then, the total variation distance between and :

If E is discrete (total variation distance between discrete measures)
if E is continuous (total variation distance between continuous measures)

It can be imagined as the half of the sum of the surface difference between their PDF. Here is the normalization.

Properties of Total Variation Distance

is a distance on probability measures:

Symmetric:
Nonnegative:
Definite:
Triangle inequality:

The total variation distance (TV) is a distance on probability measures.

Kullback-Leibler (KL) divergence

Let and be discrete probability distributions with PMFs and respectively. Let’s also assume and have a common sample space . Then the KL divergence (also known as relative entropy ) between and is defined by:

where the sum is only over the support of .

If and are continuous probability distributions with PDFs and on a common sample space , then:

where the sum is again only over the support of .

Properties of KL-divergence

No-Symmetric:
Nonnegative:
Definite:
No triangle inequality: in general

The Kullback-Leibler (KL) divergence is NOT a distance.

For example, KL divergence between 2 Gaussian distributions and :

Explanation can be found in Recitation 5 notes (PDF).

Maximum likelihood estimation

Let be a statistical model, and is the true parameter, in order to find an estimator , we can minimize the KL-divergence:

This approach will naturally lead to the construction of the maximum likelihood estimator.

The KL divergence can be written as an expectation with respect to the distribution . In general, it is easier to build an estimator for the KL divergence than it is to build an estimator for the total variation distance.

The left part is a constant . The right part can be estimated by an average, by the Law of Large Number (LLN). So the KL estimator can be written as below:

We want to solve:

This is the maximum likelihood principle.

The likelihood is the function:

For example: likelihood of a Bernoulli Statistical Model:

For example, likelihood for the Gaussian Statistical Model:

Maximum Likelihood Estimator

The maximum likelihood estimator for is defined to be:

And in practice, we use a lot the log-likelihood estimator:

For example: Maximum Likelihood Estimator of a Poisson Statistical Model. Let for some unknown . The associated statistical model is . Likelihood of a Poisson Statistical Model can be written:

And the log-likelihood is: .

The derivative of the log-likelihood can be written:

and if we set the above equation to 0, we can get: .

Consistency of MLE

Given i.i.d samples and an associated statistical model , the maximum likelihood estimator of is a consistent estimator under mild regularity conditions (e.g. continuity in of the pdf almost everywhere), i.e.

Note that this is true even if the parameter is a vector in a higher dimensional parameter space , and is a multivariate random variable, e.g. if for a Gaussian statistical model.

This can be proven by KL divergence: the true parameter is identifiable.

Covariance

If and are random variables with respective means and , then recall the covariance of and (written ) is defined to be

It shows that the covariance can be calculated with and both centered, or just one centered.

The properties of covariance:

If and are independent, then .

In general, the converse of the last property is NOT true, except if is a Gaussian vector. Think a counter example that and , but are not independent: Consider X which is . Let be a random variable which is always if , and uniformly distributed over if . Notice that . On the other hand, . However, and are not independent.

Covariance matrix

Let be a random vector of size . Let denote the entry-wise mean, i.e.

Then the covariance matrix can be written as:

This matrix has a size of . The term on the th row and th column is .

And: .

The multivariate Gaussian distribution

A random vector is a Gaussian vector, or multivariate Gaussian or normal variable, if any linear combination of its components is a (univariate) Gaussian variable or a constant (a “Gaussian” variable with zero variance), i.e., if is (univariate) Gaussian or constant for any constant non-zero vector .

The distribution of , the -dimensional Gaussian or normal distribution, is completely specified by the vector mean and the covariance matrix . If is invertible, then the pdf of is:

where is the determinant of the , which is positive when is invertible.

In 2 dimensions (, ), its PDF depends on 5 parameters: and .

If and is the identity matrix, then is called a standard normal random vector.

Note that when the covariant matrix is diagonal, the PDF factors into PDFs of univariate Gaussians, and hence the components are independent.

The multivariate CLT

The CLT may be generalized to averages or random vectors (also vectors of averages). Let be independent copies of a random vector such that , ,

Equivalently:

Multivariate Delta method

Let sequence of random vectors in such that:

for some and some covariance .

Let be continuously differentiable at . Then:

where:

Fisher Information

Define the log-likelihood for one observation as:

Assume that is a.s. twice differentiable. Under some regularity conditions, the Fisher information of the statistical model is defined as:

If , we get:

For example, let :

So we can see that the fisher information of Bernoulli distribution is .

The fisher information of common distributions can be easily found on the Wikipedia’s pages. Like this one: wiki/Bernoulli_distribution

Asymptotic normality of the MLE

Let (the true parameter). Assume the following:

The parameter is identifiable.
For all , the support of does not depend on ;
is not on the boundary of ;
is invertible in a neighborhood of ;
A few more technical conditions.

Then, satisfies:

, w.r.t. ;
, w.r.t. .

The Fisher information at the true parameter determines the asymptotic variance of the random variable .

The method of moments

Moments

Let be i.i.d. sample associated with a statistical model , assume that , and , for some .

Population moments: Let ,

Empirical moments: Let

The k moment is the mean (expectation) of .

From LLN,

More compactly, we say that the whole vector converges:

Moments estimator

Let:

Assume is one to one:

The definition of moments estimator of :

provided it exists.

For example: let be the statistical model of a normal random variable . Let

Then:

Mapping parameters to moments. Let:

is one-to-one on the given domain and , then:

Generalized method of moments

Under some technical conditions, the method of moments estimator is asymptotically normal. Applying the multivariate CLT and Delta method yields:

The quantity above is referred to as the asymptotic variance.

MLE vs. Moment estimator

Comparison of the quadratic risks: In general, the MLE is more accurate.
MLE still gives good results if model is misspecified.
Computational issues: Sometimes, the MLE is intractable but MM is easier (polynomial equations).

M-estimation

M-estimation involves estimating some parameter of interest related to the underlying, unknown distribution (e.g. its mean, variance, or quantiles). Unlike maximum likelihood estimation and the method of moments, no statistical model needs to be assumed to perform M-estimation. M-estimation can be used in both a parametric and non-parametric context.

The definition of M-estimation:

Let be i.i.d. with some unknown distribution and an associated parameter on a sample space . We make no modeling assumption that is from any particular family of distributions.

An M-estimator of the parameter is the argmin of an estimator of a function of the parameter which satisfies the following:

for some function , where is the set of all possible values of the unknown true parameter ;
attains a unique minimum at , in . That is, .

In general, the goal is to find the loss function such has the properties stated above.

Note that the function is in particular a function of the random variable , and the expectation in is to be taken against the true distribution of , with associated parameter value .

Because is an expectation, we can construct a (consistent) estimator of by replacing the expectation in its definition by the sample mean.

Maximum likelihood estimation is a special case of M-estimation. In MLE case, the loss function .

Mean as a Minimizer

In 1-d case, let , and , if , then is the mean of , a.k.a. .

Proof:

If we set , we can get .

It also works when , and . Just let , then .

Median as a Minimizer

In 1-d case, let , and , if , then is the median of .

Proof:

If we set , we can get , by the definition of median, is the median of .

Quantile as a Minimizer

The check function is defined as:

Assume that is a continuous random variable with density . Define the -quantile of to be such that

(It is different from the quantile function for a standard normal distribution, is such that .)

We have to proof that any α-quantile of X satisfies: .

Proof:

If we set , we can get , by definition is the -quantile of .

Asymptotic Normality of M-estimators

The and matrices:

In one dimension, i.e. , the matrices reduce to the following:

In the log-likelihood case (write ), both of these functions are equal to the Fisher information:

Be careful that, in MLE case, the loss function is , which is the negative of log-likelihood. With MLE we max-imize the objective function, whereas with M-estimation, we mini-mize the objective function.

Under some technical conditions, the functions and determine the asymptotic variance of the M-estimator :

M-estimators in robust statistics

When estimators are more resilient to corruptions or mistakes in the data than others, such estimators are referred to as robust .

The empirical median is more robust than the empirical mean. However, the median estimator takes absolute function as loss function: , which is not a continuous function at . So there is no way to get and . To bypass this problem, we can use Huber’s Loss, which is defined as:

This Huber’s loss is differentiable everywhere.