Regularization#

Techniques to reduce overfitting by constraining model complexity.

L2 Regularization (Weight Decay)#

Add penalty to loss: $L_\text{reg} = L + \frac{\lambda}{2} |\theta|^2$

Gradient update: $\theta \leftarrow \theta - \alpha(\nabla L + \lambda\theta) = (1 - \alpha\lambda)\theta - \alpha\nabla L$

Shrinks weights toward zero. Equivalent to Gaussian prior on weights.

L1 Regularization (Lasso)#

$$L_\text{reg} = L + \lambda |\theta|_1$$

Encourages sparsity — many weights become exactly 0. Equivalent to Laplace prior.

Dropout#

During training, randomly zero out each neuron with probability $p$:

# PyTorch
self.dropout = nn.Dropout(p=0.5)
h = self.dropout(h)  # only active during training

At test time, scale activations by $(1-p)$ [or equivalently, use inverted dropout during training].

Intuition: prevents co-adaptation; approximates ensemble of $2^n$ networks.

Typical rates: $p = 0.1$–$0.3$ for dense layers; less common in conv layers.

Batch Normalization#

Normalize each mini-batch: $\hat{h} = (h - \mu_B) / \sqrt{\sigma^2_B + \varepsilon}$

Then learned scale/shift: $y = \gamma \hat{h} + \beta$

Benefits:

  • Reduces internal covariate shift
  • Acts as mild regularizer
  • Allows higher learning rates
  • Reduces sensitivity to initialization

At inference: use running statistics accumulated during training.

Layer Normalization#

Normalize across features (not batch): used in Transformers.

$$\text{LN}(x) = \frac{x - \mu_x}{\sqrt{\sigma^2_x + \varepsilon}} \cdot \gamma + \beta$$

Works with batch size = 1; no difference between train/test.

Early Stopping#

Monitor validation loss; stop training when it starts increasing. Simple and effective.

Data Augmentation#

Artificially expand dataset: flips, crops, color jitter, mixup, cutmix. Strong regularization, especially in vision.

Summary#

Method Mechanism When to use
Weight decay penalize large weights almost always
Dropout random zeroing FC layers, Transformers
Batch norm normalize activations CNNs, deep networks
Layer norm normalize per token Transformers
Early stopping monitor val loss always
Data augmentation expand training set vision, NLP