Information Theory Background¹

Consider a random variable $X$ that takes on certain values $x_{i}$ with a probability distribution $p (x)$ . We want to convey in a message to someone which events in $X$ have occurred in the sequence they occurred. (Think of $x_{i}$ as words from a vocabulary and the message as a sentence from the language.) We can simply send the actual $x_{i}$ in whatever encoding (say, 8-bit binary) we like. But let’s consider the problem of finding the encoding that leads to the smallest average message length.

First off, we immediately realize that fixed length codes will be wasteful when the objective is to not waste any bits, since we would necessarily have to choose an encoding length capable of covering all the events (the entire support of $X$ ), but not all events would be equally likely, and we’ll end up sending even the most likely events in the same encoding, thereby wasting bits. What if we could use shorter encodings for more frequent events and longer for less frequent ones? That’s called variable length encoding, and turns out its a great idea for optimality.

When finding variable length encodings, in order to decode the message, we need to know where the code for each event ends (since we can’t rely on length). Using stop “words” would again be wasteful. To achieve this without stop “words”, we need to make sure that no code is a prefix of another code (such as 10 and 1011), otherwise we wouldn’t be able to tell those two codes apart. This is called the prefix property, and the codes prefix codes.

However, with prefix codes, choosing a codeword for an event means sacrificing the rest of the space of codes starting with that codeword. Thus, choosing a shorter codeword for one event forces us to choose a longer code for another, thereby potentially increasing the average code length. This is our cost in terms of the space of possible codewords when assigning codes. If $L (x)$ is the length of the codeword in bits, we sacrifice $\frac{1}{2 ^{L}}$ of the space of codewords for this event.

For example, 3 bits can possibly represent 8 numbers or partitions, and by reserving one of those codes, we effectively block $\frac{1}{8} t h$ of the space for this event, since every other event is now restricted to using the other 7 partitions. So that’s our cost in terms of the space of possible codes.

We are interested in finding the optimal encoding for this set - one where the average message length is shortest.

It turns out that this optimal encoding occurs when we allot our budget to each word in proportion to its frequency distribution, i.e., more frequent words get shorter codes. In other words, for each $x$ , we bear a cost exactly equal to its probability $p (x)$ ,

\frac{1}{2 ^{L (x)}} = p (x)

which gives us the optimal code length for each word.

⟹ L (x) = lo g_{2} (\frac{1}{p ( x )})

Entropy

Entropy of a distribution $p$ , $H (p)$ is the expected length of a message in the optimal encoding for $p$ .

H (p) = E_{p} [L (x)] = x \sum [p (x) lo g_{2} (\frac{1}{p ( x )})]

Entropy, directly, is the least number of bits you need to communicate events from $p$ . It gives us a way of quantifying how much information is contained in a distribution.

Which brings us to uncertainty. How much information does an absolutely certain event convey? None! We knew anyway it was going to happen. How much information does a uniform distribution over a bunch of numbers convey? A lot! We can’t predict one event over another since they all have equal probabilities of occurring.

The more diffuse the probability distribution, the more uncertain we are about any of the values, the more we learn on average when we find out what happened, and the longer our messages have to be on average.

Perplexity

Perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution.

The perplexity of a discrete probability distribution p is defined as

perplexity (p) = 2^{H (p)}

Cross Entropy

Now, suppose we use the optimal encoding for one distribution $p$ to encode events occurring according to a different distribution $q$ . We define the cross-entropy of $q$ wrt $p$ as the average length of messages from $q$ using the optimal encoding scheme of $p$ :

H_{p} (q) = E_{q} [L_{p} (x)] = x \sum [q (x) lo g_{2} (\frac{1}{p ( x )})]

Note that it is not symmetrical.

Cross entropy gives us a way to think about how different a distribution $p$ is from $q$ . $H_{p} (q)$ is longer than $H (q)$ because the values in $q$ occur with different frequencies than in $p$ . Had the two distributions been the same, they would’ve had the same encoding and thus the cross entropy would’ve been equal to the entropy of either of them.

The more different these distributions are, the longer the cross entropy will be compared to the entropy, i.e., the larger $H_{p} (q) - H (q)$ will be.

Kullback–Leibler Divergence

$H_{p} (q) - H (q)$ therefore gives us a neat way to compare two distributions, and is called the KL Divergence of $q$ wrt $p$ . It is the extra number of bits we need to use to encode $q$ in the optimal encoding of $p$ .

D_{K L} (q ∥ p) = H_{p} (q) - H (q) (1)

which, if you do the math, is

D_{K L} (q ∥ p) = x \sum [q (x) lo g_{2} (\frac{q ( x )}{p ( x )})]

which is also written as the expected value of the log likelihood ratio:

D_{K L} (q ∥ p) = E_{x \sim q (x)} [lo g_{2} (\frac{q ( x )}{p ( x )})]

KL divergence is again not symmetrical. It’s not a distance metric (which are symmetric), it’s a divergence measure.

A note on distance metrics vs divergence.

While both distance metrics and divergences indicate how far apart two entities are, they are a bit different. Distance metrics² are:

symmetrical,

and satisfy the triangle inequality.

Divergences are defined specifically on probability metrics. While all distances between probability distributions are divergences, the converse is not true.

Taking a look at $Eq . (1)$ , we see that the cross-entropy and KL divergence differ by the entropy term. If $p_{d a t a}$ is the true data distribution³ and $q_{θ}$ is our model distribution that we are trying to match to $p_{d a t a}$ , since the entropy of true data distribution $H_{p_{d a t a}}$ is independent of model parameters, minimizing the cross-entropy $H_{q_{θ}} (p_{d a t a}) = - \sum_{i}^{n} [y_{d a t a}^{(i)} lo g_{2} (y_{p re d}^{(i)})]$ is equivalent to minimizing the KL divergence $D_{K L} (p_{d a t a} ∥ q_{θ})$ . This is what we do often in classification problems.⁴

Suppose we want to approximate a distribution $p$ using a simpler distribution $q$ . We can do this by minimizing $D_{K L} (q ∥ p)$ or $D_{K L} (p ∥ q)$ . The resulting distribution $q$ will have different characteristics based on what KL we choose:

Forward KL or Inclusive KL

$D_{K L} (p ∥ q) = \int p (x) lo g \frac{p ( x )}{q ( x )} d x$
M-projection or moment projection
Zero-avoiding or mode-covering
Minimizing the KL will force $q$ to include all the areas of space for which $p$ has non-zero probability, and will typically over-estimate the support of p.

Reverse KL or Exclusive KL

$D_{K L} (q ∥ p) = \int q (x) lo g \frac{q ( x )}{p ( x )} d x$
I-projection or information projection
Zero-forcing or mode-seeking
Minimizing the exclusive KL will force $q$ to exclude all the areas of space for which $p$ has zero probability, and $q$ will typically under-estimate the support of $p$ .

Maximum Likelihood Estimation

Defining Likelihood

In probability theory, likelihood $L (θ; x)$ or $L (θ ∣ x)$ of parameter $θ$ given some fixed data $x$ is any function of $θ$ that is proportional to the sampling density (of $x$ ), i.e.,
$L_{x} (θ) = f_{θ} (x)$
In Bayesian statistics, we usually express Bayes’ theorem in its simplest form as:
$π (θ ∣ x) \propto π (θ) \cdot L (θ ∣ x)$
also expressed as:
$posterior \propto prior \times likelihood$
Strictly speaking we do not require that the likelihood be equal to the sampling density; it needs only be proportional, which allows removal of multiplicative parts that do not depend on the parameters.

We know that minimizing the KL divergence is a way to bring two distributions closer. Let $x_{1}, x_{2}, \dots, x_{n}$ be some data that we assume is distributed according to a true data distribution $p_{d a t a}$ . Let $q_{θ}$ be some parametric distribution with parameters $θ$ that we are trying to fit the data we have (since $q$ is a function of both $X$ and $θ$ , we can also write the density as $q_{θ} (X; θ)$ - this just means $q$ is a function of $X$ parameterized by $θ$ ). We can do this by choosing $θ$ in a way that maximizes the likelihood of that data, i.e.,

θ_{MLE}^{*} = θ argmax q_{θ} (X)

Since we assume the training data is $i . i . d$ , the expression becomes

θ_{MLE}^{*} = θ argmax i = 1 \prod n q_{θ} (x_{i})

Taking $log$ allows us to change the product to a summation outside, and since it is an increasing function, does not affect $argmax$ .

θ_{MLE}^{*} = θ argmax i = 1 \sum n lo g (q_{θ} (x_{i}))

Conditional MLE

While here we considered the unconditional distribution of data (such as a normal distribution fitting the height data of sampled males in a country), MLE can just as easily be applied to conditional distributions, and is actually the most common situation because it forms the basis for most supervised learning.

If $X$ represents all our inputs and $Y$ all our observed targets, then the conditional maximum likelihood estimator is:
$θ_{MLE}^{*} = θ argmax q_{θ} (Y ∣ X))$
which in the $i . i . d .$ case becomes:
$θ_{MLE}^{*} = θ argmax i = 1 \prod n q_{θ} (y_{i} ∣ x_{i})$

KL Divergence and MLE

Because the $argmax$ does not change when we rescale the expression, we can divide the above expression by $n$ to get the sample mean ( $\overset{ˉ}{X_{n}}$ ) of the log likelihood:

θ_{MLE}^{*} = θ argmax \frac{1}{n} i = 1 \sum n lo g (q_{θ} (x_{i}))

Then, by the law of large numbers, this sample mean must converge to the true expectation as $n \to \infty$ , so asymptotically:

θ_{MLE}^{*} = θ argmax E_{x \sim p_{d a t a}} [lo g (q_{θ} (x))] (2)

We can also estimate $p_{d a t a}$ with $q_{θ}$ by choosing $θ$ in a way that we minimize the KL divergence $D_{KL} (p_{d a t a} ∥ q_{θ})$ :

θ_{KL}^{*} = θ argmin D_{KL} (p_{d a t a} ∥ q_{θ}) (3)

Using the definition of $D_{KL} (p_{d a t a} ∥ q_{θ})$ :

D_{KL} (p_{d a t a} ∥ q_{θ}) = E_{x \sim p_{d a t a}} [lo g_{2} (\frac{p _{d a t a} ( x )}{q _{θ} ( x )})]

Simplifying,

D_{KL} (p_{d a t a} ∥ q_{θ}) = E_{x \sim p_{d a t a}} [lo g_{2} (p_{d a t a} (x))] - E_{x \sim p_{d a t a}} [lo g_{2} (q_{θ} (x))]

Plugging it back into $(3)$ and noting the first term is a constant wrt $θ$ :

θ_{KL}^{*} = θ argmin [C - E_{x \sim p_{d a t a}} [lo g_{2} (q_{θ} (x))]]

which further simplifying (taking the minus sign out, changing $argmin$ to $argmax$ , and multiplying by $lo g (2)$ ) gives us

θ_{KL}^{*} = θ argmax E_{x \sim p_{d a t a}} [lo g (q_{θ} (x))]

which is exactly equal to $θ_{MLE}^{*}$ in $(2)$ , thus proving that maximizing likelihood is asymptotically equal to minimizing KL divergence.

Also note that the non constant quantity is the cross-entropy $H_{q_{θ}} (p_{d a t a})$ . Thus, maximizing likelihood is also equivalent to minimizing the cross entropy.

Negative Log Likelihood Loss (NLL) a.k.a. Log Loss

Negative of Log likelihood, so that instead of maximizing likelihood, we can estimate by minimizing the negative of it like a typical loss function.

Cross Entropy Loss

Softmax + NLL loss.

Bibliography

This section is a condensed version of the excellent blog: Visual Information Theory -- Colah’s Blog (2015) ↩
For formal definition of distances, see Metric space - Wikipedia ↩
Technically, while the generating process may generate data with a true data distribution of $p_{d a t a}$ , the sample distribution may differ slightly, and is called the empirical distribution. Since we don’t have access to the true distribution, this sample distribution is what we try to estimate. ↩
See also on why cross entropy is a better idea sometimes: What is the difference between Cross-entropy and KL divergence? ↩

Visual Information Theory -- colah’s blog. (2015, October 14). https://colah.github.io/posts/2015-09-Visual-Information/

Adarsh's Notes

Explorer

Entropy, KL Divergence, MLE, and Cross-Entropy Loss

Information Theory Background¹

Entropy

Perplexity

Cross Entropy

Kullback–Leibler Divergence

Forward KL or Inclusive KL

Reverse KL or Exclusive KL

Maximum Likelihood Estimation

KL Divergence and MLE

Negative Log Likelihood Loss (NLL) a.k.a. Log Loss

Cross Entropy Loss

Bibliography

Table of Contents

Adarsh's Notes

Explorer

Entropy, KL Divergence, MLE, and Cross-Entropy Loss

Information Theory Background1

Entropy

Perplexity

Cross Entropy

Kullback–Leibler Divergence

Forward KL or Inclusive KL

Reverse KL or Exclusive KL

Maximum Likelihood Estimation

KL Divergence and MLE

Some Related Loss Functions

Negative Log Likelihood Loss (NLL) a.k.a. Log Loss

Cross Entropy Loss

Bibliography

Footnotes

Table of Contents

Information Theory Background¹