Convolutional Neural Networks#

Convolution Operation#

For input $X$ ($H \times W \times C$) and filter $K$ ($k \times k \times C \times F$):

$$\text{output}(i, j, f) = \sum_c \sum_p \sum_q K(p, q, c, f) \cdot X(i+p, j+q, c) + b(f)$$

  • Stride $s$: step size between filter applications
  • Padding $p$: zeros added around input to control output size

Output size: $\lfloor (H + 2p - k)/s + 1 \rfloor$

Why Convolutions Work for Images#

  • Local connectivity: each output depends on a small receptive field
  • Weight sharing: same filter across all positions — translation equivariance
  • Hierarchical features: edges → textures → parts → objects

Pooling#

Reduces spatial dimensions:

  • Max pooling: take max in each window — most common
  • Average pooling: take mean — used before classification head
  • Global average pooling: collapse entire feature map to $1 \times 1$

Standard Block#

Conv(3×3, stride 1, padding 1) → BatchNorm → ReLU → (repeat) → MaxPool(2×2)

Parameter Count#

A conv layer with $k \times k$ kernel, $C_\text{in}$ channels, $C_\text{out}$ filters:

$$\text{Params} = k^2 \cdot C_\text{in} \cdot C_\text{out} + C_\text{out}$$

vs. FC layer with same dimensions: $H \times W \times C_\text{in} \times C_\text{out}$ — much larger.

Classic Architectures#

Model Year Innovation
LeNet-5 1998 first successful CNN
AlexNet 2012 deep + ReLU + dropout + GPU
VGG 2014 very deep, 3×3 convs only
GoogLeNet 2014 Inception modules, 1×1 convs
ResNet 2015 residual connections
DenseNet 2017 dense connections
EfficientNet 2019 compound scaling
ConvNeXt 2022 modernized ResNet

Receptive Field#

The region of the input that influences a given output unit. After $L$ layers with kernel size $k$ and stride 1:

$$\text{RF} = 1 + L(k-1)$$

Larger RF needed for global reasoning; achieved via deeper networks, dilated convolutions, or pooling.