Learning Rate Schedules#
Why Schedule the Learning Rate#
- Start with moderate LR to make initial progress
- Decay to fine-tune into good minima
- Warmup avoids instability at initialization (especially large models)
Step Decay#
Multiply by $\gamma$ every $k$ epochs:
$$\alpha(t) = \alpha_0 \cdot \gamma^{\lfloor t/k \rfloor}$$
Simple, commonly used in CV. Typical: $\gamma = 0.1$ every 30 epochs for ImageNet.
Cosine Annealing#
Smooth decay following cosine curve:
$$\alpha(t) = \alpha_\min + \tfrac{1}{2}(\alpha_\max - \alpha_\min)!\left(1 + \cos!\frac{\pi t}{T}\right)$$
- $t$: current step, $T$: total steps
- Goes from $\alpha_\max$ to $\alpha_\min$ smoothly
- Cosine with restarts (SGDR): reset LR periodically to escape local minima
torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=T, eta_min=1e-6)Linear Warmup#
Ramp up from 0 (or small value) to target LR over first $N$ steps:
$$\alpha(t) = \alpha_\text{target} \cdot \frac{t}{N}, \quad t \leq N$$
Prevents large gradient updates from random initialization. Essential for Transformers.
Warmup + Cosine Decay#
Standard recipe for LLM training:
- Linear warmup: $0 \to \alpha_\max$ over first 1–2% of steps
- Cosine decay: $\alpha_\max \to \alpha_\min$ over remaining steps
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return 0.5 * (1 + math.cos(math.pi * progress))1-Cycle Policy (Super-Convergence)#
Increase LR from low to high, then decrease back — but use very high peak LR. Can train 10× faster than traditional schedules in some settings (fastai).
Inverse Square Root (Transformer original)#
$$\alpha(t) = d_\text{model}^{-0.5} \cdot \min!\left(t^{-0.5},, t \cdot \text{warmup_steps}^{-1.5}\right)$$
Used in “Attention Is All You Need”. Rarely used now; warmup+cosine more common.
Practical Defaults#
| Setting | Schedule |
|---|---|
| Vision (ResNet/ViT) | Cosine annealing + 5-epoch warmup |
| LLM pretraining | Warmup 2K steps + cosine decay |
| Fine-tuning | Constant or linear decay, smaller $\alpha$ |
| RL | Constant LR or linear decay |