Object Detection#
Task#
Locate and classify multiple objects in an image. Output: bounding boxes + class labels + confidence scores.
Bounding Box Parameterization#
Box = (x_center, y_center, width, height) normalized by image dims.
IoU (Intersection over Union) = |A∩B| / |A∪B| — standard overlap metric.
Two-Stage Detectors#
- Region Proposal Network (RPN): proposes candidate regions
- ROI pooling/align: extract fixed-size features for each proposal
- Classification + regression head: refine box and predict class
Faster R-CNN: anchor-based RPN + ROI Align. High accuracy, slower.
One-Stage Detectors#
Predict boxes directly from dense grid of anchors.
YOLO series: real-time, single-pass. Each grid cell predicts B boxes + C class scores. YOLOv8/v9/v10: anchor-free, end-to-end, strong speed-accuracy tradeoff.
SSD: multi-scale feature maps, anchor boxes at each scale.
RetinaNet: FPN backbone + focal loss (handles class imbalance in dense detection).
Anchor-Free Detectors#
Predict box center + offsets without predefined anchors:
- FCOS: predict (l,r,t,b) distances from point to box edges
- CenterNet: detect objects as keypoints (heatmap at center)
- DETR: Transformer-based, set prediction with bipartite matching loss
Feature Pyramid Network (FPN)#
Multi-scale feature extraction: bottom-up (backbone) + top-down (with lateral connections).
Detects small objects (high-res features) and large objects (semantic features) simultaneously.
Non-Maximum Suppression (NMS)#
Post-processing: remove duplicate boxes.
- Sort boxes by confidence
- Select highest-confidence box
- Remove all boxes with IoU > threshold (default 0.5) with selected box
- Repeat
Soft-NMS: decay scores of overlapping boxes instead of removing — better for crowded scenes.
COCO Metrics#
| Metric | Description |
|---|---|
| [email protected] | mean AP at IoU threshold 0.5 |
| [email protected]:0.95 | mean AP averaged over thresholds 0.5–0.95 |
| AP_S/M/L | AP for small/medium/large objects |