Image Classification#

Task#

Map image x ∈ ℝᴴˣᵂˣ³ to class label y ∈ {1,…,K}.

Model	Top-1	Params	Year
AlexNet	63.3%	61M	2012
VGG-16	74.4%	138M	2014
ResNet-50	76.1%	25M	2015
EfficientNet-B7	84.3%	66M	2019
ViT-L/16	85.2%	307M	2021
ConvNeXt-XL	87.0%	350M	2022

Divide image into 16×16 patches → flatten → linear project to d_model → add position embed → Transformer encoder.

For 224×224 image with 16×16 patches: 196 patches + 1 [CLS] token = 197 tokens.

Requires large-scale pretraining (JFT-300M, ImageNet-21K) to match CNNs without inductive bias.

Pretrain on large dataset (ImageNet, JFT) → fine-tune on target task.

Fine-tuning strategies:

Standard: random crop, horizontal flip, color jitter. Strong: RandAugment, AutoAugment, Mixup, CutMix.

Mixup: x̃ = λxᵢ + (1-λ)xⱼ, ỹ = λyᵢ + (1-λ)yⱼ