Convolutional Neural Networks#

Convolution Operation#

For input $X$ ($H \times W \times C$) and filter $K$ ($k \times k \times C \times F$):

$$\text{output}(i, j, f) = \sum_c \sum_p \sum_q K(p, q, c, f) \cdot X(i+p, j+q, c) + b(f)$$

Output size: $\lfloor (H + 2p - k)/s + 1 \rfloor$

Reduces spatial dimensions:

Conv(3×3, stride 1, padding 1) → BatchNorm → ReLU → (repeat) → MaxPool(2×2)

A conv layer with $k \times k$ kernel, $C_\text{in}$ channels, $C_\text{out}$ filters:

$$\text{Params} = k^2 \cdot C_\text{in} \cdot C_\text{out} + C_\text{out}$$

vs. FC layer with same dimensions: $H \times W \times C_\text{in} \times C_\text{out}$ — much larger.

The region of the input that influences a given output unit. After $L$ layers with kernel size $k$ and stride 1:

$$\text{RF} = 1 + L(k-1)$$

Larger RF needed for global reasoning; achieved via deeper networks, dilated convolutions, or pooling.