Optimization#

Gradient Descent#

Iteratively move in the direction of steepest descent:

$$\theta \leftarrow \theta - \alpha \nabla_\theta L(\theta)$$

  • $\alpha$ (learning rate): step size — too large diverges, too small is slow
  • Convergence: guaranteed for convex $L$ with appropriate $\alpha$

Convexity#

A function $f$ is convex if the line segment between any two points lies above the curve:

$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y) \quad \text{for } \lambda \in [0,1]$$

For convex $f$: any local minimum is a global minimum. Most neural network losses are non-convex.

Stochastic Gradient Descent (SGD)#

Use a mini-batch of size $B$ instead of full dataset:

$$\theta \leftarrow \theta - \frac{\alpha}{B} \sum_{i \in \text{batch}} \nabla_\theta L(x_i, \theta)$$

  • Noisy gradient acts as implicit regularization
  • Much cheaper per step than full-batch GD

Momentum#

Accumulate velocity to dampen oscillations:

$$v \leftarrow \beta v - \alpha \nabla L(\theta), \qquad \theta \leftarrow \theta + v$$

$\beta \approx 0.9$ is typical. Nesterov momentum evaluates gradient at the “lookahead” position.

Adam#

Adaptive Moment Estimation — combines momentum with per-parameter learning rates:

$$m \leftarrow \beta_1 m + (1-\beta_1) g \qquad \text{(1st moment: mean)}$$ $$v \leftarrow \beta_2 v + (1-\beta_2) g^2 \qquad \text{(2nd moment: variance)}$$ $$\hat{m} = \frac{m}{1-\beta_1^t}, \quad \hat{v} = \frac{v}{1-\beta_2^t} \qquad \text{(bias correction)}$$ $$\theta \leftarrow \theta - \frac{\alpha, \hat{m}}{\sqrt{\hat{v}} + \varepsilon}$$

Defaults: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\varepsilon = 10^{-8}$, $\alpha = 10^{-3}$.

Saddle Points vs. Local Minima#

In high-dimensional spaces, saddle points (gradient = 0 but not a min/max) are far more common than local minima. Gradient noise helps escape saddle points. In practice, neural nets converge to flat minima that generalize well.

Learning Rate Schedules#

Schedule Formula Use
Step decay $\alpha \cdot \gamma^{\lfloor t/k \rfloor}$ simple, piecewise
Cosine annealing $\alpha_\min + \tfrac{1}{2}(\alpha_\max - \alpha_\min)(1 + \cos(\pi t / T))$ smooth, widely used
Warmup linearly increase then decay Transformer training
1-cycle increase then decrease fast training