Optimization#
Gradient Descent#
Iteratively move in the direction of steepest descent:
$$\theta \leftarrow \theta - \alpha \nabla_\theta L(\theta)$$
- $\alpha$ (learning rate): step size — too large diverges, too small is slow
- Convergence: guaranteed for convex $L$ with appropriate $\alpha$
Convexity#
A function $f$ is convex if the line segment between any two points lies above the curve:
$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y) \quad \text{for } \lambda \in [0,1]$$
For convex $f$: any local minimum is a global minimum. Most neural network losses are non-convex.
Stochastic Gradient Descent (SGD)#
Use a mini-batch of size $B$ instead of full dataset:
$$\theta \leftarrow \theta - \frac{\alpha}{B} \sum_{i \in \text{batch}} \nabla_\theta L(x_i, \theta)$$
- Noisy gradient acts as implicit regularization
- Much cheaper per step than full-batch GD
Momentum#
Accumulate velocity to dampen oscillations:
$$v \leftarrow \beta v - \alpha \nabla L(\theta), \qquad \theta \leftarrow \theta + v$$
$\beta \approx 0.9$ is typical. Nesterov momentum evaluates gradient at the “lookahead” position.
Adam#
Adaptive Moment Estimation — combines momentum with per-parameter learning rates:
$$m \leftarrow \beta_1 m + (1-\beta_1) g \qquad \text{(1st moment: mean)}$$ $$v \leftarrow \beta_2 v + (1-\beta_2) g^2 \qquad \text{(2nd moment: variance)}$$ $$\hat{m} = \frac{m}{1-\beta_1^t}, \quad \hat{v} = \frac{v}{1-\beta_2^t} \qquad \text{(bias correction)}$$ $$\theta \leftarrow \theta - \frac{\alpha, \hat{m}}{\sqrt{\hat{v}} + \varepsilon}$$
Defaults: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\varepsilon = 10^{-8}$, $\alpha = 10^{-3}$.
Saddle Points vs. Local Minima#
In high-dimensional spaces, saddle points (gradient = 0 but not a min/max) are far more common than local minima. Gradient noise helps escape saddle points. In practice, neural nets converge to flat minima that generalize well.
Learning Rate Schedules#
| Schedule | Formula | Use |
|---|---|---|
| Step decay | $\alpha \cdot \gamma^{\lfloor t/k \rfloor}$ | simple, piecewise |
| Cosine annealing | $\alpha_\min + \tfrac{1}{2}(\alpha_\max - \alpha_\min)(1 + \cos(\pi t / T))$ | smooth, widely used |
| Warmup | linearly increase then decay | Transformer training |
| 1-cycle | increase then decrease | fast training |