January 3, 2020
Unit 5 - Bayesian statistics
Notes on Bayesian inference, prior and posterior distributions, Bayes estimators, and Jeffreys' prior from MITx Fundamentals of Statistics
Article Notes
Published
January 3, 2020
Reading Time
7 minutes
Format
Mathematical notebook entry with static MathJax rendering.
An alternative of the frequentist approach is the Bayesian approach. In a sense, Bayesian inference amounts to having a likelihood function
Compare the frequentist approach and the Bayesian approach
The frequentist approach:
- Assume a statistical model
. - We assumed that the data
was drawn i.i.d from for some unknown fixed . - When we used the MLE for example, we looked at all possible
. - Before seeing the data we did not prefer a choice of
over another.
The Bayesian approach:
- In many practical contexts, we have a prior belief about
. - Using the data, we want to update that belief and transform it into a posterior belief.
In Bayesian statistics, the true parameter is modeled as a random variable, or at the very least, the uncertainty regarding the true parameter is modeled as such. The Bayesian approach gives statisticians some freedom to reflect prior belief.
Prior and posterior
- Consider a probability distribution on a parameter space
with some pdf : the prior distribution. - Let
be a sample of random variables. - Denote by
the joint pdf of conditionally on , where . - Remark:
is the likelihood used in the frequentist approach. - The conditional distribution of
given is called the posterior distribution. Denote by its pdf.
Bayes’ formula
Bayes’ formula states that:
The constant does not depend on
Bayes’ theorem is stated mathematically as the following equation:
where and are events and .
Example: Bernoulli Experiment with the Beta Prior
Take “flip coin” as an example, we select a prior
Given
Hence the posterior:
The posterior distribution is:
In this example, the posterior distribution is also a Beta distribution, just like the prior distribution. We call it a conjugate prior.
Prior Distribution
The prior distribution is to be specified by the researcher in order to take into account previous knowledge about possible values of the parameter.
When applying the Bayesian framework, we have considerable freedom in specifying the family of our prior distribution. We must consider the following factors in deciding on our prior:
- Whether or not we could specify the parameters of the distribution so that its shape approximates our prior belief
- Whether or not the support of the distribution is realistic based on our context
- How tractable it would be to compute the posterior distribution and perform inference from it, given the form of the likelihood function
Non informative priors
We can still use a Bayesian approach if we have no prior information about the parameter. Good candidate:
- If
is bounded, this is the uniform prior on . - If
is unbounded, this does not define a proper pdf on .
An improper prior on
A uniform prior reflects an equal belief in each possible hypothesis. The maximum a-posteriori and maximum likelihood estimators when using such a prior would always be the same.
Beta Distribution as priors
The Beta distribution is very suited to models where our parameter represents a probability due to its support being
Jeffreys’ Prior
Jeffreys’ Prior is an attempt to incorporate frequentist ideas of likelihood in the Bayesian framework, as well as an example of a non-informative prior. This prior depends on the statistical model used for the observation data and the likelihood function. Mathematically, it is the prior
where
In the one-variable case, Jeffreys’ prior reduces to:
The Fisher information matrix
Let our parameter of interest be
- The Jeffreys’ prior gives more weight to values of
whose MLE estimate has less uncertainty. - As a result, the Jeffreys’ prior yields more weight to values of
where the data has more information towards deciding the parameter. - The Fisher information can be taken as a proxy for how much, at a particular parameter value
, would equivalent shifts to the parameter influence the data. Thus, Jeffreys’ prior gives more weight to regions where the potential outcomes are more sensitive to slight changes in .
Bayesian confidence regions
For
Note that
“Bayesian confidence region” and “confidence interval” are two distinct notions.
Bayesian estimation
The Bayesian framework can also be used to estimate the true underlying parameter (hence, in a frequentist approach). In this case, the prior distribution does not reflect a prior belief: It is just an artificial tool used in order to define a new class of estimators.
Back to the frequentist approach: The sample
- Define a prior (that can be improper) with pdf
on the parameter space . - Compute the posterior pdf
associated with .
Bayes estimator:
This is the posterior mean. The Bayesian estimator depends on the choice of the prior distribution
Another popular choice is the point that maximizes the posterior distribution, provided it is unique. It is called the MAP (maximum a posteriori):
In the previous examples, with prior
In particular, for a=1/2 (Jeffreys’ prior):
In this example, the Bayes estimator is consistent and asymptotically normal.
In general, the asymptotic properties of the Bayes estimator do not depend on the choice of the prior.