Faster R-CNN (ResNet-101 backbone) with 1024×1024 input. A two-stage detector consisting of:
- Backbone (ResNet-101): Deep backbone for high-level semantic feature extraction.
- RPN: Generates candidate object regions via anchors.
- RoI Align: Resizes regions to fixed resolution, preserving spatial alignment.
- Detection Head: Classifies regions and refines bounding boxes.
COCO 2017 Validation
| Metric | Value | Metric | Value |
|---|---|---|---|
| mAP @[0.50:0.95] | 0.317 | AP @ 0.50 | 0.456 |
| Avg IoU (Matched) | 0.830 | AP (Small/Med/Lrg) | 0.127 / 0.347 / 0.498 |
Pascal VOC 2007 (COCO metrics)
| Metric | Value | Metric | Value |
|---|---|---|---|
| mAP @[0.50:0.95] | 0.524 | AP @ 0.50 | 0.744 |
| Avg IoU (Matched) | 0.841 | AP (Small/Med/Lrg) | 0.168 / 0.407 / 0.619 |
- Performance: mAP 0.317 on COCO; significantly better on VOC (0.524).
- Localization: High Avg IoU (~0.83) confirms the two-stage refinement is highly effective.
- Scale: Performs well on Medium/Large objects but struggles with Small objects.
Successful Detections:
Failure Cases:
Feature Maps:
SSDLite320 (MobileNetV3-Large backbone). A single-stage, efficient detector for edge devices.
- Backbone: Lightweight MobileNetV3 with SE modules.
- SSD: Direct bounding box prediction without region proposals.
- Resolution: Fixed 320×320 input for speed, limiting small object visibility.
COCO 2017 Validation
| Metric | Value |
|---|---|
| mAP @[0.50:0.95] | 0.2107 |
| AP @ 0.50 | 0.3388 |
| AP @ 0.75 | 0.2190 |
| AP (Small) | 0.0034 |
| AP (Medium) | 0.0947 |
| AP (Large) | 0.4119 |
| Mean IoU (>0.7) | 0.8062 |
Pascal VOC 2007 (COCO metrics)
| Metric | Value |
|---|---|
| mAP @[0.50:0.95] | 0.4150 |
| AP @ 0.50 | 0.6364 |
| AP @ 0.75 | 0.4443 |
| AP (Small) | 0.0056 |
| AP (Medium) | 0.1784 |
| AP (Large) | 0.5668 |
- Trade-off: The mAP (0.4150) reflects the efficiency trade-off.
- Small Object Blindness: AP (Small) is extremely low (0.0056) due to 320px downsampling.
- Strengths: Competent on large, prominent objects (AP Large 0.5668).
Successful Detections:
Failure Cases:
Feature Maps:
YOLOv12: High-speed detector using attention mechanisms (FlashAttention, Area Attention) and a hierarchical backbone (R-ELAN).
Efficiency: 59.1M Params, 199.0 GFLOPs, 11.79 ms/img latency.
Detection Performance
| Dataset | mAP @[0.50:0.95] | Avg IoU | mAP @ 0.50 |
|---|---|---|---|
| COCO 2017 | 0.718 | 0.554 | 0.841 |
| Pascal VOC 2007 | 0.898 | 0.714 | 0.888 |
Successful Cases (COCO & Pascal):
Failure Cases (COCO & Pascal):
Feature Map Visualization:





