Object Detection in 20 Years: A Survey

最新推荐文章于 2025-02-08 22:08:02 发布

lynn_0909

最新推荐文章于 2025-02-08 22:08:02 发布

阅读量653

点赞数 1

分类专栏：人工智能文章标签：目标检测

人工智能专栏收录该内容

3 篇文章

订阅专栏

本文综述了目标检测领域的关键技术和发展历程，介绍了从传统检测器到深度学习时代的经典模型，包括Viola Jones、HOG、DPM及RCNN系列、YOLO系列等。同时探讨了多尺度检测技术和常用数据集。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

From the application point of view, object detection can be grouped into two research topics “general object detection” and “detection applications”, where the former one aims to explore the methods of detecting different types of objects under a unified framework to simulate the human vision and cognition, and the later one refers to the detection under specific application scenarios, such as pedestrian detection, face detection, text detection, etc.
从应用的角度来看，目标检测可以分为“一般目标检测”和“检测应用”两个研究课题，前者旨在探索在统一的框架下检测不同类型物体的方法，以模拟人体视觉和认知，后一种是指在特定应用场景下的检测，如行人检测、人脸检测、文本检测等。

Difficulties and Challenges in Object Detection
object rotation and scale changes (e.g., small objects), accurate object localization, dense and occluded object detection, speed up of detection, etc.
物体旋转和尺度变化（如小物体）、精确的物体定位、密集和遮挡的物体检测、加快检测速度等。

1、Milestones: Traditional Detectors

Viola Jones Detectors

2002, Running on a 700MHz Pentium III CPU, sliding windows: to go through all possible
locations and scales in an image to see if any window contains a human face.
speed by incorporating three important techniques: “integral image”, “feature selection”, and “detection cascades”.
1）Integral image: The integral image is a computational method to speed up box filtering or convolution process. Like other object detection algorithms in its time [29–31], the Haar wavelet is used in VJ detector as the feature representation of an image. The integral image makes the
computational complexity of each window in VJ detector independent of its window size.
积分图像： 积分图像是一种加速盒滤波或卷积过程的计算方法。像其他时间的目标检测算法[29–31]一样，Haar小波被用于VJ检测器中，作为图像的特征表示。积分图像使VJ探测器中每个窗口的计算复杂度与窗口大小无关。
2) Feature selection: Instead of using a set of manually) selected Haar basis filters, the authors used Adaboost algorithm [32] to select a small set of features that are mostly helpful for face detection from a huge set of random features pools (about 180k-dimensional).
特征选择： 作者使用Adaboost算法[32]从一组庞大的随机特征池（约180k维）中选择一组对人脸检测最有帮助的特征，而不是使用一组手动选择的haar基滤波器。
3) Detection cascades: A multi-stage detection paradigm (a.k.a. the “detection cascades”) was introduced in VJ detec- tor to reduce its computational overhead by spending less computations on background windows but more on face targets.
检测级联： VJ检测器引入了一种多级检测范式（又称“检测级联”），通过在后台窗口上花费较少的计算，而在面目标上花费更多的计算，从而减少计算开销。

HOG Detector

Histogram of Oriented Gradients (HOG) feature descriptor was originally proposed in 2005 by N.Dalal and B.Triggs [12].
HOG can be considered as an important improvement of the scale-invariant feature transform [33, 34] and shape contexts [35] of its time.
HOG可以看作是对其时间尺度不变特征变换[33，34]和形状上下文[35]的一个重要改进。

To balance the feature invariance (including translation, scale, illumination, etc) and the nonlinearity (on discriminating different objects categories), the HOG descriptor is designed to be computed on a dense grid of uniformly spaced cells and use overlapping local contrast normalization (on “blocks”) for improving accuracy. Although HOG can be used to detect a variety of object classes, it was motivated primarily by the problem of pedestrian detection. To detect objects of different sizes, the HOG detector rescales the input image for multiple times while keeping the size of a detection window unchanged.
为了平衡特征不变性（包括平移、缩放、照明等）和非线性（用于区分不同的对象类别），HOG描述符设计为在均匀间隔的密集网格上计算，并使用重叠的局部对比度归一化（在block上）以提高精度。尽管HOG可以用于检测各种对象类，但它主要是由行人检测问题驱动的。为了检测不同大小的对象，HOG检测器在保持检测窗口大小不变的同时，多次重新调整输入图像。

Deformable Part-based Model (DPM)

The DPM follows the detection philosophy of “divide and conquer”, where the training can be simply considered as the learning of a proper way of decomposing an object, and the inference can be considered as an ensemble of detections on different object parts. For example, the problem of detecting a “car” can be considered as the detection of its window, body, and wheels.
该算法遵循“分而治之”的检测原理，即训练可以简单地看作是学习一种适当的分解对象的方法，推理可以看作是对不同对象部分的检测的集合。例如，检测“汽车”的问题可以视为检测其车窗、车身和车轮。
在这里插入图片描述

2、Milestones: CNN based Two-stage Detectors

RCNN

idea ： It starts with the extraction of a set of object proposals (object candidate boxes) by selective search [42]. Then each proposal is rescaled to a fixed size image and fed into a CNN model trained on ImageNet (say, AlexNet [40]) to extract features. Finally, linear SVM classifiers are used to predict the presence of an object within each region and to recognize object categories
它首先通过选择搜索提取一组对象建议（对象候选框）[42]。然后，将每个提案重新调整为固定大小的图像，并将其输入在ImageNet（例如，Alexnet[40]）上训练的CNN模型中，以提取特征。最后，使用线性支持向量机分类器预测每个区域中的对象存在并识别对象类别。

drawbacks ：the redundant feature computations on a large number of overlapped proposals (over 2000 boxes from one image) leads to an extremely slow detection speed (14s per image with GPU). Later in the same year, SPPNet [17] was proposed and has overcome this problem.
大量重叠方案的冗余特征计算（一张图像中超过2000个框）导致检测速度极慢（每个图像带有GPU 14秒）。同年晚些时候，SPPNET[17]提出并克服了这个问题。

SPPNet （Spatial Pyramid Pooling Networks）

idea :The main contribution of SPPNet is the introduction of a Spatial Pyramid Pooling (SPP) layer, which enables a CNN to generate a fixed-length representation regardless of the size of image/region of interest without rescaling it. When using SPPNet for object detection, the feature maps can be computed from the entire image only once, and then fixed-length representations of arbitrary regions can be generated for training the detectors, which avoids repeatedly computing the convolutional features. SPPNet is more than 20 times faster than R-CNN without sacrificing any detection accuracy (VOC07 mAP=59.2%).
SPPNET的主要贡献是引入了一个空间金字塔池（SPP）层，这使得CNN能够生成一个固定长度的表示，而不管图像/感兴趣区域的大小，而无需重新缩放它。利用SPPNET进行目标检测时，只需从整个图像中计算一次特征图，就可以生成任意区域的定长表示来训练检测器，避免了卷积特征的重复计算。SPPNet比R-CNN快20倍以上，不牺牲任何检测精度（VOC07 MAP=59.2%）。

drawbacks : Although SPPNet has effectively improved the detectionspeed, there are still some drawbacks: first, the training is still multi-stage, second, SPPNet only fine-tunes its fully connected layers while simply ignores all previous layers. Later in the next year, Fast RCNN [18] was proposed and solved these problems.
虽然SPPNet有效地提高了检测速度，但仍然存在一些缺点：一是培训仍然是多阶段的，二是SPPNet只对其完全连接的层进行微调，而忽略了之前的所有层。明年晚些时候，Fast RCNN[18]提出并解决了这些问题。

Fast RCNN

idea ：Fast RCNN enables us to simultaneously train a detector and a bounding box regressor under the same network configurations. On VOC07 dataset, Fast RCNN increased the mAP from 58.5% (RCNN) to 70.0% while with a detection speed over 200 times faster than R-CNN.
快速RCNN使我们能够在相同的网络配置下同时训练检测器和边界框回归器。在VOC07数据集上，Fast RCNN将MAP从58.5%（RCNN）增加到70.0%，同时检测速度比R-CNN快200倍以上。

drawbacks : its detection speed is still limited by the proposal detection (see Section 2.3.2 for more details). Then, a question naturally arises: “can we generate object proposals with a CNN model?” Later, Faster R-CNN [19] has answered this question.
其检测速度仍受建议检测的限制（更多详情请参见第2.3.2节）。然后，一个问题自然出现了：“我们能用CNN模型生成目标提案吗？“后来，Faster R-CNN[19]回答了这个问题。

Faster RCNN

idea ：Faster RCNN is the first end-to-end, and the first near-realtime deep learning detector (COCO mAP@.5=42.7%, COCO mAP@[.5,.95]=21.9%, VOC07 mAP=73.2%, VOC12 mAP=70.4%, 17fps with ZFNet [45]). The main contribution of Faster-RCNN is the introduction of Region Proposal Network (RPN) that enables nearly cost-free region proposals. From R-CNN to Faster RCNN, most individual blocks of an object detection system, e.g., proposal detection, feature extraction, bounding box regression, etc, have been gradually integrated into a unified, end-to-end learning framework.
更快的RCNN是第一个端到端的，也是第一个近实时深度学习检测器（COCO mAP@.5=42.7%, COCO mAP@[.5,.95]=21.9%, VOC07 mAP=73.2%, VOC12 mAP=70.4%, 17fps with ZFNet [45]）。更快的RCNN的主要贡献是引入区域提案网络（RPN），使区域提案几乎无成本。从R-CNN到更快的RCNN，对象检测系统中的大多数单个块（如建议检测、特征提取、边界框回归等）都逐渐集成到一个统一的端到端学习框架中。

drawbacks :Although Faster RCNN breaks through the speed bottleneck of Fast RCNN, there is still computation redundancy at subsequent detection stage. Later, a variety of improvements have been proposed, including RFCN [46] and Light head RCNN [47]. (See more details in Section 3.)
虽然快速RCNN突破了快速RCNN的速度瓶颈，但在后续检测阶段仍存在计算冗余。后来，提出了各种改进，包括rfcn[46]和光头rcnn[47]。（详见第3节。）

Feature Pyramid Networks(FPN)

In 2017, T.-Y. Lin et al. proposed Feature Pyramid Networks (FPN) [22] on basis of Faster RCNN. Before FPN, most of the deep learning based detectors run detection only on a network’s top layer. Although the features in deeper layers of a CNN are beneficial for category recognition, it is not conducive to localizing objects. To this end, a top- down architecture with lateral connections is developed in FPN for building high-level semantics at all scales. Since a CNN naturally forms a feature pyramid through its forward propagation, the FPN shows great advances for detecting objects with a wide variety of scales. Using FPN in a basic Faster R-CNN system, it achieves state-of-the-art single model detection results on the MSCOCO dataset without bells and whistles (COCO mAP@.5=59.1%, COCO mAP@[.5, .95]=36.2%). FPN has now become a basic building block of many latest detectors.
In 2017，T.-Y.Lin et al.提出基于Faster RCNN的特征金字塔网络。在FPN之前，大多数基于深度的探测器只在一个网络的顶层上运行。尽管CNN顶层的特征对分类识别是有益的，但对局部对象来说，并不可行。为此目的，在FPN中开发了一种由顶及下的结构，用于在所有尺度上建造高水平的语义。因为CNN它的前向传播中自然形成了一个金字塔的特征，FPN在多尺度检测物体上取得了巨大进步。使用FPN在一个基本的Faster R-CNN系统中，它实现了state-of-the-art单一模型检测结果，没有华丽的点缀，在MSCOCOO数据集中（COCOO MAP@5=59.1,COCOO MAP@[.5，.95]=36.2%。FPN现已成为许多最新检测器的基础。

3、Milestones: CNN based One-stage Detectors

You Only Look Once (YOLO)

YOLO was proposed by R. Joseph et al. in 2015. It was the first one-stage detector in deep learning era.

It can be seen from its name that the authors have completely abandoned the previous detection paradigm of “proposal detection + verification”. Instead, it follows a totally different philosophy: to apply a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region simultaneously. Later, R. Joseph has made a series of improvements on basis of YOLO and has proposed its v2 and v3 editions [48, 49], which further improve the detection accuracy while keeps a very high detection speed.
从其名称可以看出，作者完全放弃了先前的“提案检测+验证”的检测范式。相反，它遵循了一个完全不同的理念：将单个神经网络应用于完整的图像。该网络将图像分成若干区域，并同时预测每个区域的边界框和概率。后来，R.Joseph在Yolo的基础上进行了一系列改进，并提出了其V2和V3版本[48，49]，这进一步提高了检测精度，同时保持了非常高的检测速度。

In spite of its great improvement of detection speed, YOLO suffers from a drop of the localization accuracy compared with two-stage detectors, especially for some small objects. YOLO’s subsequent versions [48, 49] and the latter proposed SSD [21] has paid more attention to this problem.
尽管与两级探测器相比，Yolo的探测速度有了很大的提高，但其定位精度却下降了，特别是对于一些小物体。Yolo的后续版本[48，49]和后者提议的SSD[21]更加关注这个问题。

Single Shot MultiBox Detector (SSD)

SSD [21] was proposed by W. Liu et al. in 2015. It was the second one-stage detector in deep learning era.

The main contribution of SSD is the introduction of the multi-reference and multi-resolution detection techniques (to be introduce in Section 2.3.2), which significantly improves the detection accuracy of a one-stage detector, especially for some small objects. SSD has advantages in terms of both detection speed and accuracy (VOC07 mAP=76.8%, VOC12 mAP=74.9%, COCO mAP@.5=46.5%, mAP@[.5,.95]=26.8%, a fast version runs at 59fps). The main difference between SSD and any previous detectors is that the former one detects objects of different scales on different layers of the network, while the latter ones only run detection on their top layers.
SSD的主要贡献是引入了多参考和多分辨率检测技术（将在第2.3.2节中介绍），这显著提高了一级探测器的检测精度，特别是对于一些小物体。SSD具有检测速度快、精度高的优点（VOC07 MAP=76.8%，VOC12 MAP=74.9%，COCO MAP@.5=46.5%，MAP@[.5，.95]=26.8%，快速版59fps运行）。SSD和以前的探测器的主要区别在于SSD在网络的不同层上检测不同尺度的对象，而以前的探测器只在顶层运行检测。

RetinaNet

In despite of its high speed and simplicity, the one-stage detectors have trailed the accuracy of two-stage detectors for years. T.-Y. Lin et al. have discovered the reasons behind and proposed RetinaNet in 2017 [23]. They claimed that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. To this end, a new loss function named “focal loss” has been introduced in RetinaNet by reshaping the standard cross entropy loss so that detector will put more focus on hard, misclassified examples during training. Focal Loss enables the one-stage detectors to achieve comparable accuracy of two-stage detectors while maintaining very high detection speed. (COCO mAP@.5=59.1%, mAP@[.5, .95]=39.1%).
尽管one-stage检测器速度快且简单，但多年来其精度一直落后于two-stage探测器。T.-Y.Lin等人已发现原因并于2017年提出了RetinaNet[23]。他们声称，在密集探测器的训练过程中遇到的极端前景背景类失衡是主要原因。为此，通过对标准交叉熵损失进行整形，在视网膜神经网络中引入了一种新的损失函数 “focal loss” ，使检测器在训练过程中更加关注难分类的例子。focal loss焦点损失使一级检测器能够在保持极高检测速度的同时达到两级探测器的可比精度。（COCO mAP@.5=59.1%, mAP@[.5, .95]=39.1%）。

4、Object Detection Datasets

Pascal VOC（PASCAL Visual Object Classes (VOC) Challenges(from 2005 to 2012)）
VOC07 and VOC12；
VOC07 consists of 5k tr. images + 12k annotated objects,
VOC12 consists of 11k tr. images + 27k annotated objects.
20 classes of objects that are common in life are annotated in these two datasets (Person: person; Animal: bird, cat, cow, dog, horse, sheep; Vehicle: aeroplane, bicycle, boat, bus, car,
motor-bike, train; Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor)
ILSVRC（ImageNet Large Scale Visual Recognition Challenge）
ILSVRC is organized each yearfrom 2010 to 2017.
The ILSVRC detection dataset contains 200 classes of visual objects.
MS-COCO
MS-COCO 3 [53] is the most challenging object detection dataset available today. The annual competition based on MS-COCO dataset has been held since 2015
It has less number of object categories than ILSVRC, but more object instances.
For example, MS-COCO-17 contains 164k images and 897k annotated objects from 80 categories.
apart from the bounding box annotations, each object is further labeled using per-instance segmentation to aid in precise localization.
Open Images（Open Images Detection (OID)）
the dataset consists of 1,910k images with 15,440k annotated bounding boxes on 600 object categories.
Datasets of Other Detection Tasks

在这里插入图片描述

5、Object Detection Metrics

the per-window measurement (FPPW)
false positives per-image (FPPI)
Average Precision (AP)
mean AP (mAP)
Intersection over Union (IoU)

重点摘录

Multi-reference/-resolution detection (after 2015)

多参考检测是目前最流行的多尺度目标检测框架[19,21,44,48]。它的主要思想是在图像的不同位置预先定义一组不同大小和宽高比的参考框(即锚框)，然后根据这些参考框预测检测框。
多分辨率检测[21,22,55,105]，即在网络的不同层检测不同尺度的目标。由于CNN在正向传播过程中自然形成了一个特征金字塔，更容易在较深的层中检测到较大的物体，在较浅的层中检测到较小的物体。
多参考和多分辨率检测已成为当前最先进的目标检测系统的两个基本组成部分。

English	汉语
epitome	典型;缩影
unprecedented	前所未有的
hierarchy	层次体系
in-depth	深入地
pros and cons	利弊
vector quantization	矢量量化
over-generalized	过于广义
intraclass	同类的
divide and conquer	分治法
decompose	分解
context
priming	上下文启动
ensemble system	集成系统
To this end	为此目的
bells and whistles	华丽的点缀
abbreviation	缩写
paradigm	范式
focal	焦点的，中心的
unprecedented scale	规模空前