Object Detection#

Task#

Locate and classify multiple objects in an image. Output: bounding boxes + class labels + confidence scores.

Bounding Box Parameterization#

Box = (x_center, y_center, width, height) normalized by image dims.

IoU (Intersection over Union) = |A∩B| / |A∪B| — standard overlap metric.

Two-Stage Detectors#

Region Proposal Network (RPN): proposes candidate regions
ROI pooling/align: extract fixed-size features for each proposal
Classification + regression head: refine box and predict class

Faster R-CNN: anchor-based RPN + ROI Align. High accuracy, slower.

One-Stage Detectors#

Predict boxes directly from dense grid of anchors.

YOLO series: real-time, single-pass. Each grid cell predicts B boxes + C class scores. YOLOv8/v9/v10: anchor-free, end-to-end, strong speed-accuracy tradeoff.

SSD: multi-scale feature maps, anchor boxes at each scale.

RetinaNet: FPN backbone + focal loss (handles class imbalance in dense detection).

Anchor-Free Detectors#

Predict box center + offsets without predefined anchors:

FCOS: predict (l,r,t,b) distances from point to box edges
CenterNet: detect objects as keypoints (heatmap at center)
DETR: Transformer-based, set prediction with bipartite matching loss

Feature Pyramid Network (FPN)#

Multi-scale feature extraction: bottom-up (backbone) + top-down (with lateral connections).

Detects small objects (high-res features) and large objects (semantic features) simultaneously.

Non-Maximum Suppression (NMS)#

Post-processing: remove duplicate boxes.

Sort boxes by confidence
Select highest-confidence box
Remove all boxes with IoU > threshold (default 0.5) with selected box
Repeat

Soft-NMS: decay scores of overlapping boxes instead of removing — better for crowded scenes.

COCO Metrics#

Metric	Description
[email protected]	mean AP at IoU threshold 0.5
[email protected]:0.95	mean AP averaged over thresholds 0.5–0.95
AP_S/M/L	AP for small/medium/large objects