Some of the notes I wrote when I took an information theory course. It contains basic definitions and theorems. It doesn’t get into much details and is sort of an intuitive cheat-sheet. The content aligns with the first few chapters of the book “Elements of Information Theory” by Cover and Thomas, so you can find the proofs there (it is a very well-written and easy-to-read book, I definitely recommend it!).
Entropy
Let $X$ be a discrete random variable with alphabet $\mathcal{X}$ and probability mass function $p(x)$.
The entropy $H(X)$ of a discrete random variable $X$ is defined by $$ H(X) = -\mathbb{E}_{X\sim p(x)}\left[\log p(X)\right] = - \sum_{x\in\mathcal{X}}p(x)\log p(x). $$
-
Intuitively, the entropy $H(X)$ is a measure of how informative the observation of an instantiation of $X$ is. Alternatively, we can think of it as how uncertain we are about the value of $X$.
-
If $X$ is not random at all, we already know what an instantiation is going to be; therefore, the entropy is zero. On the other hand, the more uniformly distributed $p(x)$ is, the more uncertain we are about the value of an instantiation of $X$.
-
When calculating the entropy, we don’t consider the values of $x\in \mathcal{X}$ for which $p(x) = 0$. So, conventionally, $0\log 0 = 0\log\infty = 0$ and $p\log\infty = \infty$ in the calculations.
Joint and Conditional Entropy
The joint entropy $H(X, Y)$ of a pair of random variables is just the entropy of their joint distribution:
$$ H(X, Y) = -\mathbb{E}_{X\sim p(x), Y\sim p(y)}\left[\log p(X, Y)\right] = \sum_{x\in\mathcal{X}, y\in\mathcal{Y}}p(x, y)\log p(x, y). $$
The conditional entropy of $Y$ given $X$ is defined as the expected entropy of $Y$ given the value of $X$:
$$ H(Y|X) = -\mathbb{E}_{X\sim p(x), Y\sim p(y)}\left[\log p(Y|X)\right] \ = -\sum_{x\in\mathcal{X}, y\in \mathcal{Y}}p(x, y)\log p(y|x) \ = \sum_{x\in \mathcal{X}}p(x)H(Y|X=x). $$
- In general, $H(Y|X) \neq H(X|Y)$. For example, consider $X \sim \mathrm{Uniform} ({1, \cdots, 100})$ and $Y=X ;\mathrm{mod}; 10$.
Because of the natural definition of entropy, many of the theorems about probability distributions translate naturally for entropy. For instance, we have the chain rule:
$$ H(X, Y) = H(X) + H(Y|X) $$
KL-divergence and Mutual Information
The Kullback-Leibler divergence or relative entropy between two probability mass functions $p(x)$ and $q(x)$ is defined as
$$ D(p||q) = \mathbb{E}_{X\sim p(X)}\left[\log\frac{p(X)}{q(X)}\right] = \sum_{x\in \mathcal{X}} p(x)\log\frac{p(x)}{q(x)}. $$
-
In calculations, let $0\log {(\mathrm{whatever})} = 0$ and $p\log\frac{p}{0} = \infty$ for $p>0$.
-
You can think of KL-divergence as some sort of distance between $p$ and $q$. However, it is not a metric, as it is not symmetric and does not satisfy the triangle inequality.
-
Another way of thinking about it is as a measure of how similar samples of $p$ are to those of $q$.
-
If there is a symbol $x\in \mathcal{X}$ that may be seen when we sample from $p$, but will never appear when sampling from $q$ (i.e., $p(x)>0 = q(x)$), then the distance $D(p||q)$ is $\infty$. This is because there is a chance that, when sampling from $p$, we observe $x$, and immediately recognize that we are not sampling from $q$. Note however, that $D(q||p)$ may not be $\infty$, as not observing $x$ can not assure us that we are sampling form $p$.
The mutual information $I(X; Y)$ between two random variables is the KL-divergence between their joint distribution, $p(x, y)$ and the product of their marginals, $p(x)p(y)$:
$$ I(X; Y) = D(p(x, y)||p(x)p(y)) = \sum_{x\in\mathcal{X}, y\in \mathcal{Y}} p(x, y)\log\frac{p(x, y)}{p(x)p(y)}. $$
-
Mutual information is symmetric, i.e., $I(X;Y) = I(Y;X).$
-
You can think of mutual information as the amount of shared information between two random variables. If $X$ and $Y$ are independent, their mutual information is zero. Otherwise, their mutual information is strictly positive.
-
If $Y = f(X) $ for any one-to-one mapping $f$, then $X$ and $Y$ contain the same information; hence, their mutual information is equal to (each of) their entropies.
-
In particular, the amount of shared information between $X$ and itself, is precisely the entropy of $X$:
$$ I(X; X) = H(X) $$
-
The mutual information can also be thought of as the reduction in the uncertainty of one variable, due to the knowledge of the other, i.e.,
$$ I(X; Y) = H(Y) - H(Y|X) = H(X) - H(X|Y). $$
-
We can also write the mutual information as the sum of information in each variable, minus the total information contained (jointly) in $X$ and $Y$:
$$ I(X; Y) = H(X) + H(Y) - H(X, Y) $$
Conditional Divergence and Mutual Information
We can define conditional KL-divergence and conditional mutual information the same way we did for entropy.
The conditional mutual information between $X$ and $Y$ given $Z$ is the reduction in the uncertainty of $X$ due to the knowledge of $Y$, given that $Z$ is known:
$$ I(X;Y|Z) = H(X|Z) - H(X|Y, Z) \\ = \mathbb{E}_{X, Y, Z \sim p(x, y, z)} \left[\log\frac{p(X, Y|Z)}{p(X|Z)p(Y|Z)}\right] $$
The conditional KL-divergence between the conditional distributions $p(y|x)$ and $q(y|x)$ is defined to be
$$ D(p(y|x)||q(y|x)) = \mathbb{E}_{X, Y\sim p(x, y)}\left[\log\frac{p(Y|X)}{q(Y|X)}\right]. $$
(Notice how the expectation is taken with regards to the joint distribution $p(x, y)$)
We also have chain rule for mutual information:
$$ I(X_1,\cdots,X_n; Y) = \sum_{i=1}^{n}I(X_i;Y|X_1,\cdots, X_{i-1}), $$
and for the KL-divergence: $$ D(p(x_1, \cdots, x_n)||q(x_1, \cdots, x_n)) = \sum_{i=1}^{n} D(p(x_i|x_1,\cdots, x_{i-1})||q(x_i|x_1, \cdots,x_{i-1})) $$ Remark: To prove any of the chain rules, just
- write the quantity in question (be it entropy, divergence, or mutual information) as the expected value of a function of the joint distributions,
- decompose the joint distributions inside the expectation by conditioning on variables one at a time,
- logarithm of products is equal to sum of logarithms,
- use linearity of expectation.
Inequalities and Bounds
There are two main inequalities that are used in to provide bounds in the context of entropies: Jensen and log-sum inequalities. Before introducing them, some definitions:
A function $f: (a, b) \to \R$ is convex if for every $x_1, x_2 \in (a, b)$ and $\lambda \in [0, 1]$,
$$ f(\lambda x_1 +(1-\lambda) x_2) \leqslant \lambda f(x_1) + (1-\lambda)f(x_2). $$
If $f$ is twice differentiable, this is equivalent to $f''(x)\geqslant 0$ for all $x\in (a, b)$.
Intuitively, $f$ is convex if it lies below the line segment connecting any two points on its graph.
A function $f$ is concave if $-f$ is convex. This means that the inequality in the definition of convexity changes its direction, the second derivative is non-positive, and $f$ lies above the line segment connecting any two points on its graph.
Jensen’s Inequality: If $f$ is a convex function and $X$ is a random variable, $ \mathbb E [f(X)] \geqslant f(\mathbb E [X]).$ We particularly use this when $f$ is the logarithm function. If $f$ is concave, the direction of inequality changes.
Log-Sum Inequality: For non-negative numbers $a_1, a_2, \cdots, a_n$ and $b_1, b_2, \cdots, b_n$,
$$ \sum_{i=1}^{n} {a_i\log\frac{a_i}{b_i}} \geqslant (\sum_{i=1}^n a_i) \log\frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i} $$
with equality if and only if $\frac{a_i}{b_i} = \mathrm{constant}$.
This inequality follows from the application of Jensen to the convex function $f(t) = t\log t$.
Notice how the left hand side is similar to the definition of KL-divergence!
Now, let’s see some consequences of the application of these inequalities:
-
KL-divergence is non-negative: $D(p||q)\geqslant 0$ with equality only when $p=q$.
-
Mutual information and conditional mutual information are both non-negative: $I(X; Y|Z)\geqslant 0$ with equality only when $X$ and $Y$ are conditionally independent given $Z$.
-
(Uniform distribution gives maximum entropy) $H(X) \leqslant \log |\mathcal{X}|$ where $\mathcal{X}$ is the support of $X$ and equality if and only if $X$ is the uniform distribution over $\mathcal{X}$.
-
(Conditioning reduces entropy) Intuitively, having information about $Y$ can only decrease the uncertainty in $X$: $H(X|Y) \leqslant H(X)$.
-
(Independence bound on entropy) The joint entropy of $n$ variables can not be any larger than the sum of their individual entropies:
$$ H(X_1, \cdots, X_n) \leqslant \sum_{i=1}^n H(X_i) $$
with equality if and only if $X_i$’s are independent.
-
(Convexity of KL-divergence) $D(p||q)$ is convex in the pair $(p, q)$; that is, for any two pairs of distributions $(p_1, q_1)$ and $(p_2, q_2)$ and any $\lambda \in [0, 1]$,
$$ D(\lambda p_1 +(1-\lambda)p_2 || \lambda q_1 + (1-\lambda)q_2) \leqslant \lambda D(p_1||q_1) + (1-\lambda) D(p_2||q_2). $$
-
(Concavity of entropy) $H(p)$ is a concave function of $p$. Thus, the entropy of mixture of distributions is as large as the mixture of their entropies. This result follows easily from the convexity of KL-divergence, because:
-
Let $u$ be the uniform distribution over $\mathcal{X}$. For any distribution $p$ over $\mathcal{X}$,
$$ H(p) = H(u) - D(p||u) = \log {|\mathcal{X}|} - D(p||u) $$
Data-Processing Inequality
Let $X\rightarrow Y \rightarrow Z$ be a Markov chain, that is, $Z$ is independent of $X$ given $Y$: $p(x, y, z) = p(x)p(y|x)p(z|y)$. Then,
$$ I(X;Y) \geqslant I(X;Z). $$
-
The data processing inequality shows that no clever manipulation of the data can improve the inference that can be made with that data.
-
In particular, for any function $g$, $X \rightarrow Y \rightarrow g(Y)$ is a Markov chain, thus
$$ I(X; Y) \geqslant I(X, g(Y)) $$
which yields the previous claim about data manipulation.
Fano’s Inequality
Suppose we know a random variable $Y$ and we want to guess the value of a correlated random variable $X$. Intuitively, knowing $Y$ should allow us to better guess the value of $X$. Fano’s inequality relates the probability of error in guessing $X$ to the conditional entropy $H(X|Y)$.
Suppose we use $\hat{X} = g(Y)$ to estimate $X$ after observing $Y$ ($g$ is a possibly random function). One can see that $X \rightarrow Y \rightarrow \hat X$ forms a Markov chain. We would like to bound the probability of error in estimating $X$ using $\hat X$, defined as
$$ P_e = Pr[\hat X \neq X] $$
(Fano’s inequality) For any estimator $\hat X$ such that $X\rightarrow Y \rightarrow \hat X$ forms a Markov chain, we have
$$ H(P_e) + P_e \log {|\mathcal{X}|} \geqslant H(X|\hat X) \geqslant H(X|Y). $$
This can be weakened to
$$ 1 + P_e \log {|\mathcal X|} \geqslant H(X|Y) $$
or
$$ P_e \geqslant \frac {H(X|Y) - 1}{\log {|\mathcal{X}|}}. $$
So, the larger $H(X|Y)$ is, the more likely it is for our estimator to make an error.
Notice that $I(X; Y) = H(X) - H(X|Y)$. So, if we fix the entropy of $X$, large mutual information between $X$ and $Y$ indicate that we can better estimate $X$ given $Y$.