Residual Networks#

The Residual Connection#

He et al., 2015. Instead of learning $H(x)$, learn the residual $F(x) = H(x) - x$:

$$y = F(x, {W_i}) + x$$

If the identity is optimal, $F \to 0$ is easier to learn than $F \to x$.

Why Residuals Help#

  • Gradient flow: gradients bypass layers directly via skip connection — mitigates vanishing gradients
  • Identity shortcut: deep network can learn identity for unnecessary layers
  • Implicit ensembles: residual networks behave like ensembles of shallower networks (Veit et al., 2016)

ResNet Block#

x → Conv → BN → ReLU → Conv → BN → (+x) → ReLU

Bottleneck block (for deeper networks): 1×1 → 3×3 → 1×1, reduces parameters.

When input/output dims differ: use 1×1 conv on shortcut to match dimensions.

Architectures#

Model Layers Params Top-1 (ImageNet)
ResNet-18 18 11M 69.8%
ResNet-50 50 25M 76.1%
ResNet-101 101 45M 77.4%
ResNet-152 152 60M 78.3%

Pre-Activation ResNet (v2)#

BN → ReLU → Conv → BN → ReLU → Conv → (+x)

Improved gradient flow, slightly better accuracy.

Beyond ResNet#

  • DenseNet: each layer receives feature maps from all preceding layers
  • SENet: squeeze-and-excitation — channel-wise attention on features
  • ConvNeXt: ResNet modernized with Transformer design choices (large kernels, fewer activations, etc.)

In Transformers#

The same residual principle appears in every Transformer block:

x = x + Attention(LN(x))
x = x + FFN(LN(x))