Backpropagation#
Efficient algorithm for computing $\partial L / \partial \theta$ for all parameters $\theta$ using the chain rule.
Chain Rule#
For composed functions $f(g(x))$:
$$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$
For a computational graph, gradients flow backwards — from loss to inputs.
Forward Pass#
Compute and store intermediate values:
z₁ = W₁x + b₁
h₁ = σ(z₁)
z₂ = W₂h₁ + b₂
ŷ = z₂
L = loss(ŷ, y)Backward Pass#
Propagate $\partial L / \partial (\cdot)$ from output to input:
∂L/∂z₂ = ∂L/∂ŷ # from loss
∂L/∂W₂ = (∂L/∂z₂) h₁ᵀ # weight gradient
∂L/∂b₂ = ∂L/∂z₂ # bias gradient
∂L/∂h₁ = W₂ᵀ (∂L/∂z₂) # propagate to prev layer
∂L/∂z₁ = (∂L/∂h₁) ⊙ σ'(z₁) # through activation
∂L/∂W₁ = (∂L/∂z₁) xᵀ
∂L/∂b₁ = ∂L/∂z₁Computational Complexity#
Forward: $O(\text{parameters})$ Backward: $O(\text{parameters})$ — same order as forward, roughly $2\times$
Memory: must store all intermediate activations for the backward pass.
Gradient Tape (PyTorch Autograd)#
Modern frameworks build a dynamic computational graph and replay it backwards:
x = torch.tensor([1.0], requires_grad=True)
y = x ** 2 + 3 * x
y.backward()
x.grad # dy/dx = 2x + 3 = 5Vanishing / Exploding Gradients#
- Vanishing: gradients shrink exponentially with depth (sigmoid saturation, deep RNNs)
- Fix: ReLU activations, residual connections, LSTM gates, batch norm
- Exploding: gradients grow exponentially
- Fix: gradient clipping (clip by norm or value)
Gradient Clipping#
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)Common in RNN and Transformer training.