Loss Functions#

Mean Squared Error (MSE)#

$$L = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2$$

  • Regression tasks
  • Penalizes large errors quadratically
  • Gradient: $\partial L / \partial \hat{y} = -2(y - \hat{y})/n$

Mean Absolute Error (MAE)#

$$L = \frac{1}{n} \sum_i \lvert y_i - \hat{y}_i \rvert$$

  • More robust to outliers than MSE
  • Gradient undefined at 0; use subgradient

Huber Loss#

$$L_\delta(y, \hat{y}) = \begin{cases} \tfrac{1}{2}(y-\hat{y})^2 & \text{if } \lvert y-\hat{y} \rvert \leq \delta \ \delta(\lvert y-\hat{y} \rvert - \delta/2) & \text{otherwise} \end{cases}$$

Combines MSE (small errors) with MAE (large errors). $\delta$ is a hyperparameter.

Binary Cross-Entropy#

$$L = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]$$

  • Binary classification with sigmoid output
  • Maximum likelihood under Bernoulli model
  • Numerically stable form: F.binary_cross_entropy_with_logits

Categorical Cross-Entropy#

$$L = -\sum_k y_k \log \hat{y}_k$$

For one-hot $y$: $L = -\log \hat{y}_\text{true}$

Numerically stable: combine softmax + cross-entropy as log-sum-exp:

$$L = -z_\text{true} + \log \sum_k \exp(z_k)$$

F.cross_entropy(logits, targets)  # PyTorch: handles softmax internally

Focal Loss#

$$L = -\alpha_t (1-p_t)^\gamma \log(p_t)$$

Modifies cross-entropy to down-weight easy examples, focus on hard ones. Used in object detection (RetinaNet) with class imbalance.

KL Divergence Loss#

$$L = \text{KL}(P | Q) = \sum P(x) \log \frac{P(x)}{Q(x)}$$

Used in VAEs and knowledge distillation.

Contrastive / Triplet Loss#

Contrastive: $L = y \cdot d^2 + (1-y) \cdot \max(0, m-d)^2$

Triplet: $L = \max(0,, d(a,p) - d(a,n) + m)$

Where $d$ = distance, $a$ = anchor, $p$ = positive, $n$ = negative, $m$ = margin.

Used in metric learning, face recognition, sentence embeddings.

RLHF Reward Modeling#

Preference loss over pairs ($y_w \succ y_l$):

$$L = -\mathbb{E}[\log \sigma(r(x,y_w) - r(x,y_l))]$$

Trains a reward model from human preference data (Bradley-Terry model).