Information Theory#

Entropy#

Shannon entropy measures expected surprise (bits of information):

$$H(X) = -\sum_x p(x) \log_2 p(x) = \mathbb{E}[-\log p(X)]$$

  • Maximum when $p$ is uniform: $H = \log_2 n$ for $n$ outcomes
  • Zero when outcome is certain

Differential entropy for continuous $X$: $h(X) = -\int p(x) \log p(x), dx$

KL Divergence#

Measures how distribution $Q$ differs from reference $P$:

$$\text{KL}(P | Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_P!\left[\log \frac{P(X)}{Q(X)}\right]$$

Properties:

  • $\text{KL} \geq 0$ (Gibbs’ inequality)
  • $\text{KL}(P | Q) = 0$ iff $P = Q$
  • Not symmetric: $\text{KL}(P | Q) \neq \text{KL}(Q | P)$

Cross-Entropy#

$$H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + \text{KL}(P | Q)$$

Minimizing cross-entropy loss ≡ minimizing KL from model $Q$ to true distribution $P$.

Classification cross-entropy loss (one-hot $P$):

$$L = -\sum_k y_k \log \hat{y}k = -\log \hat{y}\text{true}$$

Mutual Information#

$$I(X; Y) = \text{KL}(P(X,Y) | P(X)P(Y)) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)$$

Measures how much knowing $Y$ reduces uncertainty about $X$.

Connection to ML#

ML concept Information theory
Cross-entropy loss $H(\text{labels}, \text{model})$
Variational lower bound (ELBO) $-\text{KL}$ + reconstruction
Rate-distortion compression tradeoff
Mutual information maximization representation learning