Probability & Statistics#
Key Distributions#
| Distribution | PMF/PDF | Mean | Variance | Use |
|---|---|---|---|---|
| Bernoulli($p$) | $p^x (1-p)^{1-x}$ | $p$ | $p(1-p)$ | binary outcome |
| Binomial($n,p$) | $\binom{n}{k} p^k (1-p)^{n-k}$ | $np$ | $np(1-p)$ | count successes |
| Gaussian($\mu,\sigma^2$) | $\frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/2\sigma^2}$ | $\mu$ | $\sigma^2$ | everything |
| Categorical($\pi$) | $\pi_k$ for class $k$ | — | — | classification |
| Dirichlet($\alpha$) | $\propto \prod x_i^{\alpha_i - 1}$ | $\alpha_i / \sum \alpha$ | — | prior over cats |
Bayes’ Theorem#
$$P(\theta \mid x) = \frac{P(x \mid \theta), P(\theta)}{P(x)}$$
- $P(\theta)$ — prior
- $P(x \mid \theta)$ — likelihood
- $P(\theta \mid x)$ — posterior
- $P(x)$ — marginal likelihood (normalizer)
Expectation and Variance#
$$\mathbb{E}[X] = \int x, p(x), dx$$
$$\text{Var}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$$
Law of total expectation: $\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X \mid Y]]$
Law of total variance: $\text{Var}[X] = \mathbb{E}[\text{Var}[X \mid Y]] + \text{Var}[\mathbb{E}[X \mid Y]]$
Maximum Likelihood Estimation#
Find $\theta$ that maximizes the likelihood of observed data:
$$\hat{\theta}\text{MLE} = \arg\max\theta \prod_i p(x_i \mid \theta) = \arg\max_\theta \sum_i \log p(x_i \mid \theta)$$
Log-sum is preferred for numerical stability.
Bias-Variance Tradeoff#
For estimator $\hat{f}(x)$ predicting $y$:
$$\mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2$$
- Bias: systematic error from wrong assumptions (underfitting)
- Variance: sensitivity to training data (overfitting)
- $\sigma^2$: irreducible noise
Central Limit Theorem#
For i.i.d. samples $x_1, \ldots, x_n$ with mean $\mu$ and variance $\sigma^2$:
$$\sqrt{n},(\bar{x} - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2) \quad \text{as } n \to \infty$$
Sample mean is approximately Gaussian for large $n$, regardless of the original distribution.