Image Classification#

Task#

Map image x ∈ ℝᴴˣᵂˣ³ to class label y ∈ {1,…,K}.

Standard Pipeline#

  1. Backbone: CNN or ViT extracts feature map
  2. Global pooling: spatial dims → feature vector
  3. Classifier head: Linear(d → K) + softmax

ImageNet Benchmarks#

Model Top-1 Params Year
AlexNet 63.3% 61M 2012
VGG-16 74.4% 138M 2014
ResNet-50 76.1% 25M 2015
EfficientNet-B7 84.3% 66M 2019
ViT-L/16 85.2% 307M 2021
ConvNeXt-XL 87.0% 350M 2022

Vision Transformer (ViT)#

Divide image into 16×16 patches → flatten → linear project to d_model → add position embed → Transformer encoder.

For 224×224 image with 16×16 patches: 196 patches + 1 [CLS] token = 197 tokens.

Requires large-scale pretraining (JFT-300M, ImageNet-21K) to match CNNs without inductive bias.

Transfer Learning#

Pretrain on large dataset (ImageNet, JFT) → fine-tune on target task.

Fine-tuning strategies:

  • Linear probe: freeze backbone, train only head (fast, lower accuracy)
  • Full fine-tune: train all layers (higher accuracy, risk of forgetting)
  • Layer-wise LR decay: lower LR for earlier layers

Data Augmentation for Classification#

Standard: random crop, horizontal flip, color jitter. Strong: RandAugment, AutoAugment, Mixup, CutMix.

Mixup: x̃ = λxᵢ + (1-λ)xⱼ, ỹ = λyᵢ + (1-λ)yⱼ