Image Classification#
Task#
Map image x ∈ ℝᴴˣᵂˣ³ to class label y ∈ {1,…,K}.
Standard Pipeline#
- Backbone: CNN or ViT extracts feature map
- Global pooling: spatial dims → feature vector
- Classifier head: Linear(d → K) + softmax
ImageNet Benchmarks#
| Model | Top-1 | Params | Year |
|---|---|---|---|
| AlexNet | 63.3% | 61M | 2012 |
| VGG-16 | 74.4% | 138M | 2014 |
| ResNet-50 | 76.1% | 25M | 2015 |
| EfficientNet-B7 | 84.3% | 66M | 2019 |
| ViT-L/16 | 85.2% | 307M | 2021 |
| ConvNeXt-XL | 87.0% | 350M | 2022 |
Vision Transformer (ViT)#
Divide image into 16×16 patches → flatten → linear project to d_model → add position embed → Transformer encoder.
For 224×224 image with 16×16 patches: 196 patches + 1 [CLS] token = 197 tokens.
Requires large-scale pretraining (JFT-300M, ImageNet-21K) to match CNNs without inductive bias.
Transfer Learning#
Pretrain on large dataset (ImageNet, JFT) → fine-tune on target task.
Fine-tuning strategies:
- Linear probe: freeze backbone, train only head (fast, lower accuracy)
- Full fine-tune: train all layers (higher accuracy, risk of forgetting)
- Layer-wise LR decay: lower LR for earlier layers
Data Augmentation for Classification#
Standard: random crop, horizontal flip, color jitter. Strong: RandAugment, AutoAugment, Mixup, CutMix.
Mixup: x̃ = λxᵢ + (1-λ)xⱼ, ỹ = λyᵢ + (1-λ)yⱼ