Residual Networks#
The Residual Connection#
He et al., 2015. Instead of learning $H(x)$, learn the residual $F(x) = H(x) - x$:
$$y = F(x, {W_i}) + x$$
If the identity is optimal, $F \to 0$ is easier to learn than $F \to x$.
Why Residuals Help#
- Gradient flow: gradients bypass layers directly via skip connection — mitigates vanishing gradients
- Identity shortcut: deep network can learn identity for unnecessary layers
- Implicit ensembles: residual networks behave like ensembles of shallower networks (Veit et al., 2016)
ResNet Block#
x → Conv → BN → ReLU → Conv → BN → (+x) → ReLUBottleneck block (for deeper networks): 1×1 → 3×3 → 1×1, reduces parameters.
When input/output dims differ: use 1×1 conv on shortcut to match dimensions.
Architectures#
| Model | Layers | Params | Top-1 (ImageNet) |
|---|---|---|---|
| ResNet-18 | 18 | 11M | 69.8% |
| ResNet-50 | 50 | 25M | 76.1% |
| ResNet-101 | 101 | 45M | 77.4% |
| ResNet-152 | 152 | 60M | 78.3% |
Pre-Activation ResNet (v2)#
BN → ReLU → Conv → BN → ReLU → Conv → (+x)
Improved gradient flow, slightly better accuracy.
Beyond ResNet#
- DenseNet: each layer receives feature maps from all preceding layers
- SENet: squeeze-and-excitation — channel-wise attention on features
- ConvNeXt: ResNet modernized with Transformer design choices (large kernels, fewer activations, etc.)
In Transformers#
The same residual principle appears in every Transformer block:
x = x + Attention(LN(x))
x = x + FFN(LN(x))