Learning Rate Schedules#

Why Schedule the Learning Rate#

  • Start with moderate LR to make initial progress
  • Decay to fine-tune into good minima
  • Warmup avoids instability at initialization (especially large models)

Step Decay#

Multiply by $\gamma$ every $k$ epochs:

$$\alpha(t) = \alpha_0 \cdot \gamma^{\lfloor t/k \rfloor}$$

Simple, commonly used in CV. Typical: $\gamma = 0.1$ every 30 epochs for ImageNet.

Cosine Annealing#

Smooth decay following cosine curve:

$$\alpha(t) = \alpha_\min + \tfrac{1}{2}(\alpha_\max - \alpha_\min)!\left(1 + \cos!\frac{\pi t}{T}\right)$$

  • $t$: current step, $T$: total steps
  • Goes from $\alpha_\max$ to $\alpha_\min$ smoothly
  • Cosine with restarts (SGDR): reset LR periodically to escape local minima
torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=T, eta_min=1e-6)

Linear Warmup#

Ramp up from 0 (or small value) to target LR over first $N$ steps:

$$\alpha(t) = \alpha_\text{target} \cdot \frac{t}{N}, \quad t \leq N$$

Prevents large gradient updates from random initialization. Essential for Transformers.

Warmup + Cosine Decay#

Standard recipe for LLM training:

  1. Linear warmup: $0 \to \alpha_\max$ over first 1–2% of steps
  2. Cosine decay: $\alpha_\max \to \alpha_\min$ over remaining steps
def lr_lambda(step):
    if step < warmup_steps:
        return step / warmup_steps
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return 0.5 * (1 + math.cos(math.pi * progress))

1-Cycle Policy (Super-Convergence)#

Increase LR from low to high, then decrease back — but use very high peak LR. Can train 10× faster than traditional schedules in some settings (fastai).

Inverse Square Root (Transformer original)#

$$\alpha(t) = d_\text{model}^{-0.5} \cdot \min!\left(t^{-0.5},, t \cdot \text{warmup_steps}^{-1.5}\right)$$

Used in “Attention Is All You Need”. Rarely used now; warmup+cosine more common.

Practical Defaults#

Setting Schedule
Vision (ResNet/ViT) Cosine annealing + 5-epoch warmup
LLM pretraining Warmup 2K steps + cosine decay
Fine-tuning Constant or linear decay, smaller $\alpha$
RL Constant LR or linear decay