Convolutional Neural Networks#
Convolution Operation#
For input $X$ ($H \times W \times C$) and filter $K$ ($k \times k \times C \times F$):
$$\text{output}(i, j, f) = \sum_c \sum_p \sum_q K(p, q, c, f) \cdot X(i+p, j+q, c) + b(f)$$
- Stride $s$: step size between filter applications
- Padding $p$: zeros added around input to control output size
Output size: $\lfloor (H + 2p - k)/s + 1 \rfloor$
Why Convolutions Work for Images#
- Local connectivity: each output depends on a small receptive field
- Weight sharing: same filter across all positions — translation equivariance
- Hierarchical features: edges → textures → parts → objects
Pooling#
Reduces spatial dimensions:
- Max pooling: take max in each window — most common
- Average pooling: take mean — used before classification head
- Global average pooling: collapse entire feature map to $1 \times 1$
Standard Block#
Conv(3×3, stride 1, padding 1) → BatchNorm → ReLU → (repeat) → MaxPool(2×2)Parameter Count#
A conv layer with $k \times k$ kernel, $C_\text{in}$ channels, $C_\text{out}$ filters:
$$\text{Params} = k^2 \cdot C_\text{in} \cdot C_\text{out} + C_\text{out}$$
vs. FC layer with same dimensions: $H \times W \times C_\text{in} \times C_\text{out}$ — much larger.
Classic Architectures#
| Model | Year | Innovation |
|---|---|---|
| LeNet-5 | 1998 | first successful CNN |
| AlexNet | 2012 | deep + ReLU + dropout + GPU |
| VGG | 2014 | very deep, 3×3 convs only |
| GoogLeNet | 2014 | Inception modules, 1×1 convs |
| ResNet | 2015 | residual connections |
| DenseNet | 2017 | dense connections |
| EfficientNet | 2019 | compound scaling |
| ConvNeXt | 2022 | modernized ResNet |
Receptive Field#
The region of the input that influences a given output unit. After $L$ layers with kernel size $k$ and stride 1:
$$\text{RF} = 1 + L(k-1)$$
Larger RF needed for global reasoning; achieved via deeper networks, dilated convolutions, or pooling.