Regularization#
Techniques to reduce overfitting by constraining model complexity.
L2 Regularization (Weight Decay)#
Add penalty to loss: $L_\text{reg} = L + \frac{\lambda}{2} |\theta|^2$
Gradient update: $\theta \leftarrow \theta - \alpha(\nabla L + \lambda\theta) = (1 - \alpha\lambda)\theta - \alpha\nabla L$
Shrinks weights toward zero. Equivalent to Gaussian prior on weights.
L1 Regularization (Lasso)#
$$L_\text{reg} = L + \lambda |\theta|_1$$
Encourages sparsity — many weights become exactly 0. Equivalent to Laplace prior.
Dropout#
During training, randomly zero out each neuron with probability $p$:
# PyTorch
self.dropout = nn.Dropout(p=0.5)
h = self.dropout(h) # only active during trainingAt test time, scale activations by $(1-p)$ [or equivalently, use inverted dropout during training].
Intuition: prevents co-adaptation; approximates ensemble of $2^n$ networks.
Typical rates: $p = 0.1$–$0.3$ for dense layers; less common in conv layers.
Batch Normalization#
Normalize each mini-batch: $\hat{h} = (h - \mu_B) / \sqrt{\sigma^2_B + \varepsilon}$
Then learned scale/shift: $y = \gamma \hat{h} + \beta$
Benefits:
- Reduces internal covariate shift
- Acts as mild regularizer
- Allows higher learning rates
- Reduces sensitivity to initialization
At inference: use running statistics accumulated during training.
Layer Normalization#
Normalize across features (not batch): used in Transformers.
$$\text{LN}(x) = \frac{x - \mu_x}{\sqrt{\sigma^2_x + \varepsilon}} \cdot \gamma + \beta$$
Works with batch size = 1; no difference between train/test.
Early Stopping#
Monitor validation loss; stop training when it starts increasing. Simple and effective.
Data Augmentation#
Artificially expand dataset: flips, crops, color jitter, mixup, cutmix. Strong regularization, especially in vision.
Summary#
| Method | Mechanism | When to use |
|---|---|---|
| Weight decay | penalize large weights | almost always |
| Dropout | random zeroing | FC layers, Transformers |
| Batch norm | normalize activations | CNNs, deep networks |
| Layer norm | normalize per token | Transformers |
| Early stopping | monitor val loss | always |
| Data augmentation | expand training set | vision, NLP |