MLE’s Optimality
For large number of training samples, the Cramér-Rao lower (see an explanation at (Ben Lambert, 2014)) bound shows that no consistent estimator has a lower mean squared error than the maximum likelihood estimator. For these reasons (consistency and efficiency), maximum likelihood is often considered the preferred estimator to use for machine learning.
Maximum A Posteriori (MAP)
While MLE offers a provably “best fit” estimator of the data, it doesn’t account for any prior beliefs we may have about the distribution parameters. In other words, the estimate only depends on the dataset. This makes MLE prone to overfitting, and incorporating a prior over the parameter in not only the proper Bayesian way to estimate the posterior, but can also lead to a better estimator (cue regularization).
The empirical risk is the average loss over the data points.
Starting from the Bayes posterior for the parameter , and ignoring since it is independent of , MAP is taking over the posterior, which results in:
We recognize, above on the right hand side, , i.e. the standard log likelihood term, and , corresponding to the prior distribution.
MAP and Regularization
Many regularized estimation strategies, such as maximum likelihood learning regularized with weight decay, can be interpreted as making the MAP approximation to Bayesian inference. This view applies when the regularization consists of adding an extra term to the objective function that corresponds to .
Not all regularization penalties correspond to MAP Bayesian inference. For example, some regularizer terms may not be the logarithm of a probability distribution. Other regularization terms depend on the data, which of course a prior probability distribution is not allowed to do.
As an example, consider a linear regression model with a Gaussian prior on the weights . If this prior is given by , then the log-prior term is proportional to the familiar weight decay penalty, plus a term that does not depend on and does not affect the learning process. MAP Bayesian inference with a Gaussian prior on the weights thus corresponds to weight decay.1
Prior
In the scenarios where Bayesian estimation is typically used, the prior begins as a relatively uniform or Gaussian distribution with high entropy, and the observation of the data usually causes the posterior to lose entropy and concentrate around a few highly likely values of the parameters.
The prior has an influence by shifting probability mass density towards regions of the parameter space that are preferred a priori. In practice, the prior often expresses a preference for models that are simpler or more smooth. Critics of the Bayesian approach identify the prior as a source of subjective human judgment impacting the predictions.
MLE is MAP with Uniform Prior
With uniform prior for the parameter (thus making all equally likely and a constant, thus removing the term from ), the MAP estimate becomes equal to the MLE, .
Footnotes
-
See the stackoverflow answer at MAP estimation as regularisation of MLE ↩