Information Theory Background1

Consider a random variable that takes on certain values with a probability distribution . We want to convey in a message to someone which events in have occurred in the sequence they occurred. (Think of as words from a vocabulary and the message as a sentence from the language.) We can simply send the actual in whatever encoding (say, 8-bit binary) we like. But let’s consider the problem of finding the encoding that leads to the smallest average message length.

First off, we immediately realize that fixed length codes will be wasteful when the objective is to not waste any bits, since we would necessarily have to choose an encoding length capable of covering all the events (the entire support of ), but not all events would be equally likely, and we’ll end up sending even the most likely events in the same encoding, thereby wasting bits. What if we could use shorter encodings for more frequent events and longer for less frequent ones? That’s called variable length encoding, and turns out its a great idea for optimality.

When finding variable length encodings, in order to decode the message, we need to know where the code for each event ends (since we can’t rely on length). Using stop “words” would again be wasteful. To achieve this without stop “words”, we need to make sure that no code is a prefix of another code (such as 10 and 1011), otherwise we wouldn’t be able to tell those two codes apart. This is called the prefix property, and the codes prefix codes.

However, with prefix codes, choosing a codeword for an event means sacrificing the rest of the space of codes starting with that codeword. Thus, choosing a shorter codeword for one event forces us to choose a longer code for another, thereby potentially increasing the average code length. This is our cost in terms of the space of possible codewords when assigning codes. If is the length of the codeword in bits, we sacrifice of the space of codewords for this event.

For example, 3 bits can possibly represent 8 numbers or partitions, and by reserving one of those codes, we effectively block of the space for this event, since every other event is now restricted to using the other 7 partitions. So that’s our cost in terms of the space of possible codes.

We are interested in finding the optimal encoding for this set - one where the average message length is shortest.

It turns out that this optimal encoding occurs when we allot our budget to each word in proportion to its frequency distribution, i.e., more frequent words get shorter codes. In other words, for each , we bear a cost exactly equal to its probability ,

which gives us the optimal code length for each word.

Entropy

Entropy of a distribution , is the expected length of a message in the optimal encoding for .

Entropy, directly, is the least number of bits you need to communicate events from . It gives us a way of quantifying how much information is contained in a distribution.

Which brings us to uncertainty. How much information does an absolutely certain event convey? None! We knew anyway it was going to happen. How much information does a uniform distribution over a bunch of numbers convey? A lot! We can’t predict one event over another since they all have equal probabilities of occurring.

The more diffuse the probability distribution, the more uncertain we are about any of the values, the more we learn on average when we find out what happened, and the longer our messages have to be on average.

Perplexity

Perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution.

The perplexity of a discrete probability distribution p is defined as

Cross Entropy

Now, suppose we use the optimal encoding for one distribution to encode events occurring according to a different distribution . We define the cross-entropy of wrt as the average length of messages from using the optimal encoding scheme of :

Note that it is not symmetrical.

Cross entropy gives us a way to think about how different a distribution is from . is longer than because the values in occur with different frequencies than in . Had the two distributions been the same, they would’ve had the same encoding and thus the cross entropy would’ve been equal to the entropy of either of them.

The more different these distributions are, the longer the cross entropy will be compared to the entropy, i.e., the larger will be.

Kullback–Leibler Divergence

therefore gives us a neat way to compare two distributions, and is called the KL Divergence of wrt . It is the extra number of bits we need to use to encode in the optimal encoding of .

which, if you do the math, is

which is also written as the expected value of the log likelihood ratio:

KL divergence is again not symmetrical. It’s not a distance metric (which are symmetric), it’s a divergence measure.

A note on distance metrics vs divergence.

While both distance metrics and divergences indicate how far apart two entities are, they are a bit different. Distance metrics2 are:

  • symmetrical,
  • and satisfy the triangle inequality.

Divergences are defined specifically on probability metrics. While all distances between probability distributions are divergences, the converse is not true.

Taking a look at , we see that the cross-entropy and KL divergence differ by the entropy term. If is the true data distribution3 and is our model distribution that we are trying to match to , since the entropy of true data distribution is independent of model parameters, minimizing the cross-entropy is equivalent to minimizing the KL divergence . This is what we do often in classification problems.4

Suppose we want to approximate a distribution using a simpler distribution . We can do this by minimizing or . The resulting distribution will have different characteristics based on what KL we choose:

Forward KL or Inclusive KL

  • M-projection or moment projection
  • Zero-avoiding or mode-covering
  • Minimizing the KL will force to include all the areas of space for which has non-zero probability, and will typically over-estimate the support of p.

Reverse KL or Exclusive KL

  • I-projection or information projection
  • Zero-forcing or mode-seeking
  • Minimizing the exclusive KL will force to exclude all the areas of space for which has zero probability, and will typically under-estimate the support of .

Maximum Likelihood Estimation

Defining Likelihood

In probability theory, likelihood or of parameter given some fixed data is any function of that is proportional to the sampling density (of ), i.e.,

In Bayesian statistics, we usually express Bayes’ theorem in its simplest form as:

also expressed as:

Strictly speaking we do not require that the likelihood be equal to the sampling density; it needs only be proportional, which allows removal of multiplicative parts that do not depend on the parameters.

We know that minimizing the KL divergence is a way to bring two distributions closer. Let be some data that we assume is distributed according to a true data distribution . Let be some parametric distribution with parameters that we are trying to fit the data we have (since is a function of both and , we can also write the density as - this just means is a function of parameterized by ). We can do this by choosing in a way that maximizes the likelihood of that data, i.e.,

Since we assume the training data is , the expression becomes

Taking allows us to change the product to a summation outside, and since it is an increasing function, does not affect .

Conditional MLE

While here we considered the unconditional distribution of data (such as a normal distribution fitting the height data of sampled males in a country), MLE can just as easily be applied to conditional distributions, and is actually the most common situation because it forms the basis for most supervised learning.

If represents all our inputs and all our observed targets, then the conditional maximum likelihood estimator is:

which in the case becomes:

KL Divergence and MLE

Because the does not change when we rescale the expression, we can divide the above expression by to get the sample mean () of the log likelihood:

Then, by the law of large numbers, this sample mean must converge to the true expectation as , so asymptotically:

We can also estimate with by choosing in a way that we minimize the KL divergence :

Using the definition of :

Simplifying,

Plugging it back into and noting the first term is a constant wrt :

which further simplifying (taking the minus sign out, changing to , and multiplying by ) gives us

which is exactly equal to in , thus proving that maximizing likelihood is asymptotically equal to minimizing KL divergence.

Also note that the non constant quantity is the cross-entropy . Thus, maximizing likelihood is also equivalent to minimizing the cross entropy.

Negative Log Likelihood Loss (NLL) a.k.a. Log Loss

Negative of Log likelihood, so that instead of maximizing likelihood, we can estimate by minimizing the negative of it like a typical loss function.

Cross Entropy Loss

Softmax + NLL loss.

Bibliography

Footnotes

  1. This section is a condensed version of the excellent blog: Visual Information Theory -- Colah’s Blog (2015)

  2. For formal definition of distances, see Metric space - Wikipedia

  3. Technically, while the generating process may generate data with a true data distribution of , the sample distribution may differ slightly, and is called the empirical distribution. Since we don’t have access to the true distribution, this sample distribution is what we try to estimate.

  4. See also on why cross entropy is a better idea sometimes: What is the difference between Cross-entropy and KL divergence?

Visual Information Theory -- colah’s blog. (2015, October 14). https://colah.github.io/posts/2015-09-Visual-Information/