Probability & Statistics#

Key Distributions#

Distribution	PMF/PDF	Mean	Variance	Use
Bernoulli($p$)	$p^x (1-p)^{1-x}$	$p$	$p(1-p)$	binary outcome
Binomial($n,p$)	$\binom{n}{k} p^k (1-p)^{n-k}$	$np$	$np(1-p)$	count successes
Gaussian($\mu,\sigma^2$)	$\frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/2\sigma^2}$	$\mu$	$\sigma^2$	everything
Categorical($\pi$)	$\pi_k$ for class $k$	—	—	classification
Dirichlet($\alpha$)	$\propto \prod x_i^{\alpha_i - 1}$	$\alpha_i / \sum \alpha$	—	prior over cats

$$P(\theta \mid x) = \frac{P(x \mid \theta), P(\theta)}{P(x)}$$

$$\mathbb{E}[X] = \int x, p(x), dx$$

$$\text{Var}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$$

Law of total expectation: $\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X \mid Y]]$

Law of total variance: $\text{Var}[X] = \mathbb{E}[\text{Var}[X \mid Y]] + \text{Var}[\mathbb{E}[X \mid Y]]$

Find $\theta$ that maximizes the likelihood of observed data:

$$\hat{\theta}\text{MLE} = \arg\max\theta \prod_i p(x_i \mid \theta) = \arg\max_\theta \sum_i \log p(x_i \mid \theta)$$

Log-sum is preferred for numerical stability.

For estimator $\hat{f}(x)$ predicting $y$:

$$\mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2$$

For i.i.d. samples $x_1, \ldots, x_n$ with mean $\mu$ and variance $\sigma^2$:

$$\sqrt{n},(\bar{x} - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2) \quad \text{as } n \to \infty$$

Sample mean is approximately Gaussian for large $n$, regardless of the original distribution.