Information Theory#
Entropy#
Shannon entropy measures expected surprise (bits of information):
$$H(X) = -\sum_x p(x) \log_2 p(x) = \mathbb{E}[-\log p(X)]$$
- Maximum when $p$ is uniform: $H = \log_2 n$ for $n$ outcomes
- Zero when outcome is certain
Differential entropy for continuous $X$: $h(X) = -\int p(x) \log p(x), dx$
KL Divergence#
Measures how distribution $Q$ differs from reference $P$:
$$\text{KL}(P | Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_P!\left[\log \frac{P(X)}{Q(X)}\right]$$
Properties:
- $\text{KL} \geq 0$ (Gibbs’ inequality)
- $\text{KL}(P | Q) = 0$ iff $P = Q$
- Not symmetric: $\text{KL}(P | Q) \neq \text{KL}(Q | P)$
Cross-Entropy#
$$H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + \text{KL}(P | Q)$$
Minimizing cross-entropy loss ≡ minimizing KL from model $Q$ to true distribution $P$.
Classification cross-entropy loss (one-hot $P$):
$$L = -\sum_k y_k \log \hat{y}k = -\log \hat{y}\text{true}$$
Mutual Information#
$$I(X; Y) = \text{KL}(P(X,Y) | P(X)P(Y)) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)$$
Measures how much knowing $Y$ reduces uncertainty about $X$.
Connection to ML#
| ML concept | Information theory |
|---|---|
| Cross-entropy loss | $H(\text{labels}, \text{model})$ |
| Variational lower bound (ELBO) | $-\text{KL}$ + reconstruction |
| Rate-distortion | compression tradeoff |
| Mutual information maximization | representation learning |