Regularization#

Techniques to reduce overfitting by constraining model complexity.

L2 Regularization (Weight Decay)#

Add penalty to loss: $L_\text{reg} = L + \frac{\lambda}{2} |\theta|^2$

Gradient update: $\theta \leftarrow \theta - \alpha(\nabla L + \lambda\theta) = (1 - \alpha\lambda)\theta - \alpha\nabla L$

Shrinks weights toward zero. Equivalent to Gaussian prior on weights.

$$L_\text{reg} = L + \lambda |\theta|_1$$

Encourages sparsity — many weights become exactly 0. Equivalent to Laplace prior.

During training, randomly zero out each neuron with probability $p$:

# PyTorch
self.dropout = nn.Dropout(p=0.5)
h = self.dropout(h)  # only active during training

At test time, scale activations by $(1-p)$ [or equivalently, use inverted dropout during training].

Intuition: prevents co-adaptation; approximates ensemble of $2^n$ networks.

Typical rates: $p = 0.1$–$0.3$ for dense layers; less common in conv layers.

Normalize each mini-batch: $\hat{h} = (h - \mu_B) / \sqrt{\sigma^2_B + \varepsilon}$

Then learned scale/shift: $y = \gamma \hat{h} + \beta$

Benefits:

At inference: use running statistics accumulated during training.

Normalize across features (not batch): used in Transformers.

$$\text{LN}(x) = \frac{x - \mu_x}{\sqrt{\sigma^2_x + \varepsilon}} \cdot \gamma + \beta$$

Works with batch size = 1; no difference between train/test.

Monitor validation loss; stop training when it starts increasing. Simple and effective.

Artificially expand dataset: flips, crops, color jitter, mixup, cutmix. Strong regularization, especially in vision.