Backpropagation#

Efficient algorithm for computing $\partial L / \partial \theta$ for all parameters $\theta$ using the chain rule.

Chain Rule#

For composed functions $f(g(x))$:

$$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$

For a computational graph, gradients flow backwards — from loss to inputs.

Forward Pass#

Compute and store intermediate values:

z₁ = W₁x + b₁
h₁ = σ(z₁)
z₂ = W₂h₁ + b₂
ŷ  = z₂
L  = loss(ŷ, y)

Backward Pass#

Propagate $\partial L / \partial (\cdot)$ from output to input:

∂L/∂z₂ = ∂L/∂ŷ                      # from loss
∂L/∂W₂ = (∂L/∂z₂) h₁ᵀ               # weight gradient
∂L/∂b₂ = ∂L/∂z₂                      # bias gradient
∂L/∂h₁ = W₂ᵀ (∂L/∂z₂)               # propagate to prev layer
∂L/∂z₁ = (∂L/∂h₁) ⊙ σ'(z₁)          # through activation
∂L/∂W₁ = (∂L/∂z₁) xᵀ
∂L/∂b₁ = ∂L/∂z₁

Computational Complexity#

Forward: $O(\text{parameters})$ Backward: $O(\text{parameters})$ — same order as forward, roughly $2\times$

Memory: must store all intermediate activations for the backward pass.

Gradient Tape (PyTorch Autograd)#

Modern frameworks build a dynamic computational graph and replay it backwards:

x = torch.tensor([1.0], requires_grad=True)
y = x ** 2 + 3 * x
y.backward()
x.grad  # dy/dx = 2x + 3 = 5

Vanishing / Exploding Gradients#

  • Vanishing: gradients shrink exponentially with depth (sigmoid saturation, deep RNNs)
    • Fix: ReLU activations, residual connections, LSTM gates, batch norm
  • Exploding: gradients grow exponentially
    • Fix: gradient clipping (clip by norm or value)

Gradient Clipping#

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Common in RNN and Transformer training.