Residual Networks#

The Residual Connection#

He et al., 2015. Instead of learning $H(x)$, learn the residual $F(x) = H(x) - x$:

$$y = F(x, {W_i}) + x$$

If the identity is optimal, $F \to 0$ is easier to learn than $F \to x$.

Gradient flow: gradients bypass layers directly via skip connection — mitigates vanishing gradients
Identity shortcut: deep network can learn identity for unnecessary layers
Implicit ensembles: residual networks behave like ensembles of shallower networks (Veit et al., 2016)

x → Conv → BN → ReLU → Conv → BN → (+x) → ReLU

Bottleneck block (for deeper networks): 1×1 → 3×3 → 1×1, reduces parameters.

When input/output dims differ: use 1×1 conv on shortcut to match dimensions.

Model	Layers	Params	Top-1 (ImageNet)
ResNet-18	18	11M	69.8%
ResNet-50	50	25M	76.1%
ResNet-101	101	45M	77.4%
ResNet-152	152	60M	78.3%

BN → ReLU → Conv → BN → ReLU → Conv → (+x)

Improved gradient flow, slightly better accuracy.

DenseNet: each layer receives feature maps from all preceding layers
SENet: squeeze-and-excitation — channel-wise attention on features
ConvNeXt: ResNet modernized with Transformer design choices (large kernels, fewer activations, etc.)

The same residual principle appears in every Transformer block:

x = x + Attention(LN(x))
x = x + FFN(LN(x))