Probability & Statistics#

Key Distributions#

Distribution PMF/PDF Mean Variance Use
Bernoulli($p$) $p^x (1-p)^{1-x}$ $p$ $p(1-p)$ binary outcome
Binomial($n,p$) $\binom{n}{k} p^k (1-p)^{n-k}$ $np$ $np(1-p)$ count successes
Gaussian($\mu,\sigma^2$) $\frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/2\sigma^2}$ $\mu$ $\sigma^2$ everything
Categorical($\pi$) $\pi_k$ for class $k$ classification
Dirichlet($\alpha$) $\propto \prod x_i^{\alpha_i - 1}$ $\alpha_i / \sum \alpha$ prior over cats

Bayes’ Theorem#

$$P(\theta \mid x) = \frac{P(x \mid \theta), P(\theta)}{P(x)}$$

  • $P(\theta)$ — prior
  • $P(x \mid \theta)$ — likelihood
  • $P(\theta \mid x)$ — posterior
  • $P(x)$ — marginal likelihood (normalizer)

Expectation and Variance#

$$\mathbb{E}[X] = \int x, p(x), dx$$

$$\text{Var}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$$

Law of total expectation: $\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X \mid Y]]$

Law of total variance: $\text{Var}[X] = \mathbb{E}[\text{Var}[X \mid Y]] + \text{Var}[\mathbb{E}[X \mid Y]]$

Maximum Likelihood Estimation#

Find $\theta$ that maximizes the likelihood of observed data:

$$\hat{\theta}\text{MLE} = \arg\max\theta \prod_i p(x_i \mid \theta) = \arg\max_\theta \sum_i \log p(x_i \mid \theta)$$

Log-sum is preferred for numerical stability.

Bias-Variance Tradeoff#

For estimator $\hat{f}(x)$ predicting $y$:

$$\mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2$$

  • Bias: systematic error from wrong assumptions (underfitting)
  • Variance: sensitivity to training data (overfitting)
  • $\sigma^2$: irreducible noise

Central Limit Theorem#

For i.i.d. samples $x_1, \ldots, x_n$ with mean $\mu$ and variance $\sigma^2$:

$$\sqrt{n},(\bar{x} - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2) \quad \text{as } n \to \infty$$

Sample mean is approximately Gaussian for large $n$, regardless of the original distribution.