Information Theory#

Entropy#

Shannon entropy measures expected surprise (bits of information):

$$H(X) = -\sum_x p(x) \log_2 p(x) = \mathbb{E}[-\log p(X)]$$

Differential entropy for continuous $X$: $h(X) = -\int p(x) \log p(x), dx$

Measures how distribution $Q$ differs from reference $P$:

$$\text{KL}(P | Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_P!\left[\log \frac{P(X)}{Q(X)}\right]$$

Properties:

$$H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + \text{KL}(P | Q)$$

Minimizing cross-entropy loss ≡ minimizing KL from model $Q$ to true distribution $P$.

Classification cross-entropy loss (one-hot $P$):

$$L = -\sum_k y_k \log \hat{y}k = -\log \hat{y}\text{true}$$

$$I(X; Y) = \text{KL}(P(X,Y) | P(X)P(Y)) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)$$

Measures how much knowing $Y$ reduces uncertainty about $X$.

ML concept	Information theory
Cross-entropy loss	$H(\text{labels}, \text{model})$
Variational lower bound (ELBO)	$-\text{KL}$ + reconstruction
Rate-distortion	compression tradeoff
Mutual information maximization	representation learning