Loss Functions#
Mean Squared Error (MSE)#
$$L = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2$$
- Regression tasks
- Penalizes large errors quadratically
- Gradient: $\partial L / \partial \hat{y} = -2(y - \hat{y})/n$
Mean Absolute Error (MAE)#
$$L = \frac{1}{n} \sum_i \lvert y_i - \hat{y}_i \rvert$$
- More robust to outliers than MSE
- Gradient undefined at 0; use subgradient
Huber Loss#
$$L_\delta(y, \hat{y}) = \begin{cases} \tfrac{1}{2}(y-\hat{y})^2 & \text{if } \lvert y-\hat{y} \rvert \leq \delta \ \delta(\lvert y-\hat{y} \rvert - \delta/2) & \text{otherwise} \end{cases}$$
Combines MSE (small errors) with MAE (large errors). $\delta$ is a hyperparameter.
Binary Cross-Entropy#
$$L = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]$$
- Binary classification with sigmoid output
- Maximum likelihood under Bernoulli model
- Numerically stable form:
F.binary_cross_entropy_with_logits
Categorical Cross-Entropy#
$$L = -\sum_k y_k \log \hat{y}_k$$
For one-hot $y$: $L = -\log \hat{y}_\text{true}$
Numerically stable: combine softmax + cross-entropy as log-sum-exp:
$$L = -z_\text{true} + \log \sum_k \exp(z_k)$$
F.cross_entropy(logits, targets) # PyTorch: handles softmax internallyFocal Loss#
$$L = -\alpha_t (1-p_t)^\gamma \log(p_t)$$
Modifies cross-entropy to down-weight easy examples, focus on hard ones. Used in object detection (RetinaNet) with class imbalance.
KL Divergence Loss#
$$L = \text{KL}(P | Q) = \sum P(x) \log \frac{P(x)}{Q(x)}$$
Used in VAEs and knowledge distillation.
Contrastive / Triplet Loss#
Contrastive: $L = y \cdot d^2 + (1-y) \cdot \max(0, m-d)^2$
Triplet: $L = \max(0,, d(a,p) - d(a,n) + m)$
Where $d$ = distance, $a$ = anchor, $p$ = positive, $n$ = negative, $m$ = margin.
Used in metric learning, face recognition, sentence embeddings.
RLHF Reward Modeling#
Preference loss over pairs ($y_w \succ y_l$):
$$L = -\mathbb{E}[\log \sigma(r(x,y_w) - r(x,y_l))]$$
Trains a reward model from human preference data (Bradley-Terry model).