Scalable Object Detection using Deep Neural Networks
作者:Dumitru Erhan,Christian Szegedy, Alexander Toshev等
发表时间:2013
Abstract
Deep convolutional neural networks have recently achieved state-of-the-art performance on a number of image recognition benchmarks, including the ImageNet Large-Scale Visual Recognition Challenge(ILSVRC-2012). The winning model on the localization sub-task was a network that predicts a single bounding box and a confidence score for each object category in the image. Such a model captures the whole-image context around the objects but cannot handle multiple instances of the same object in the image without naively replicating the number of outputs for each instance. In this work, we propose a saliency-inspired neural network model for detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing any object of interest. The model naturally handles a variable number of instances for each class and allows for crossclass generalization at the highest levels of the network. We are able to obtain competitive recognition performance on VOC2007 and ILSVRC2012, while using only the top few predicted locations in each image and a small number of neural network evaluations.
深度卷积神经网络最近在包括ILSVRC-2012等多个图像识别基准上取得了最新的性能。定位子任务的获胜模型是在预测图像中对每个对象类别预测单个边界框和置信度分数的网络。这样的模型捕获目标对象周围的整个图像上下文,但若不复制每个实例的输出数量,则不能处理图像中相同对象的多个实例。在本文中,我们提出一个用于检测的显著性启发式神经网络模型,该模型预测一组类不可知的边界框,其中每个框都包含其感兴趣对象的可能性分数。该模型自然地为每个类处理数量可变的实例,并允许在网络的最高层中进行跨类泛化。当在每幅图上仅使用前几个预测为并且采用少量的神经网络评价指标,我们能够得到在VOC2007和ILSVRC2012上的竞赛级别的识别性能。
1.Introduction
Object detection is one of the fundamental tasks in computer vision. A common paradigm to address this problem is to train object detectors which operate on a subimage and apply these detectors in an exhaustive manner across all locations and scales. This paradigm was successfully used within a discriminatively trained Deformable Part Model (DPM) to achieve state-of-art results on detection tasks [6].
目标检测是计算机视觉的基本任务之一。解决这个问题的一个常见范例是训练对子图像进行操作的目标检测器,并以穷举的方式在所有位置和尺度上应用这些检测器。该范例在受过区别训练的可变形部件模型(DPM)内被成功使用,并获得检测任务的最新结果[6]。
The exhaustive search through all possible locations and scales poses a computational challenge. This challenge becomes even harder as the number of classes grows, since most of the approaches train a separate detector per class. In order to address this issue a variety of methods were proposed, varying from detector cascades, to using segmentation to suggest a small number of object hypotheses [14, 2, 4].
穷尽搜索通过所有可能的位置和尺度构成了一个计算挑战。随着类数量的增加,这个挑战变得更加困难,因为大多数方法针对每个类训练单独的检测器。为了解决这个问题,提出了各种方法,从检测器串联到使用分割来建议少量对象假设[14,2,4]。
In this paper,we ascribe to the latter philosphy and propose to train a detector, called “DeepMultiBox”,’ which generates a few bounding boxes as object candidates. These boxes are generated by a single DNN in a class agnostic manner. Our model has several contributions. First, we define object detection as a regression problem to the coordinates of several bounding boxes. In addition, for each predicted box the net outputs a confidence score of how likely this box contains an object. This is quite different from traditional approaches,which score features within predefined boxes, and has the advantage of expressing detection of objects in a very compact and efficient way.
在本文中,我们根据后者的哲学逻辑提出培养一个探测器,称为“DeepMultiBox”,产生一些边界框作为候选对象。这些框由一个DNN以类不相关的方式生成。我们的模型有几个贡献。首先,我们将物体检测作为对若干边界框坐标的回归问题。此外,对于每个预测的框,网络输出一个置信度评分表明该框包含一个对象的可能性。与传统的方法不同,预定义框中的分数特征,并且具有以非常紧凑和高效的方式表示对象检测的优点。
The second major contribution is the loss, which trains the bounding box predictors as part of the network training. For each training example,we solve an assignment problem between the current predictions and the groundtruth boxes and update the matched box coordinates, their confidences and the underlying features through Backpropagation. In this way, we learn a deep net tailored towards our localization problem. We capitalize on the excellent representation learning abilities of DNNs,as recently exeplified recently in image classification [10] and object detection settings [13], and perform joint learning of representation and predictors.
第二个主要贡献是损失,训练边界框预测器作为网络训练的一部分。对于每个训练实例,我们解决当前预测与真实框之间的分配问题,并通过反向传播更新匹配的盒坐标、它们的置信度以及底层特征。通过这种方式,我们学习了一个针对我们定位问题的深层网络。我们利用了DNN最近在图像分类[10]和目标检测[13]中表现出的优良的学习能力,并且执行表现和预测器的联合学习。
Finally, we train our object box predictor in a classagnostic manner. We consider this as a scalable way to enable efficient detection of large number of object classes. We show in our experiments that by only post-classifying less than ten boxes, obtained by a single network application, we can achieve state-of-art detection results. Further, we show that our box predictor generalizes over unseen classes and as such is flexible to be re-used within other detection problems.
最后,我们以类不可知的方式训练我们的目标框预测器。我们将其视作一种可扩展的方式来实现对大量对象类的有效检测。我们在实验中表明,通过仅对单个网络应用获得的不到十个盒子进行后分类,就可以获得最先进的检测结果。此外,我们证明了我们的框预测器在不可见类上灵活的泛化是可行的,可以在其他检测问题中重复使用。
2.Previous work
The literature on object detection is vast, and in this section we will focus on approaches exploiting class-agnostic ideas and addressing scalability.
关于对象检测的文献很多,在本节中,我们将重点介绍利用类不相关思想和解决可扩展性的方法。
Many of the proposed detection approaches are based on part-based models [7], which more recently have achieved impressive performance thanks to discriminative learning and carefully crafted features[6]. These methods, however,rely on exhaustive application of part templates over multiple scales and as such are expensive. Moreover, they scale linearly in the number of classes, which becomes a challenge for modern datasets such as ImageNet.
许多提出的检测方法是基于基于零件的模型[7],最近由于鉴别学习和精心设计的特征,该模型取得了令人印象深刻的性能[6]。然而,这些方法依赖于部分模板在多个尺度上的穷尽性应用,因此是昂贵的。此外,它们在类数上线性地缩放,这对于现代数据集(如ImageNet)来说是一个挑战。
To address the former issue, Lampert et al. [11] use a branch-and-bound strategy to avoid evaluating all potential object locations. To address the latter issue,Song et al.[12] use a low-dimensional part basis, shared across all object classes. A hashing based approach for efficient part detection has shown good results as well [3].
为了解决前一个问题,Lampert等人[11],使用分支定界策略来避免评估所有可能的对象位置。为了解决后一个问题,Song等人[12]使用低维部分基础,在所有对象类之间共享。一种基于散列的有效部分检测方法也显示了良好的效果 [3]。
A different line of work, closer to ours, is based on the idea that objects can be localized without having to know their class. Some of these approaches build on bottom-up classless segmentation [9]. The segments, obtained in this way, can be scored using top-down feedback [14, 2, 4]. Using the same motivation, Alexe et al. [1] use an inexpensive classifier to score object hypotheses for being an object or not and in this way reduce the number of location for the subsequent detection steps. These approaches can be thought of as Multi-layered models, with segmentation as first layer and a segment classification as a subsequent layer. Despite the fact that they encode proven perceptual principles, we will show that having deeper models which are fully learned can lead to superior results.
与我们更接近的是一种基于这样一种理念的不同工作,即对象可以被定位,而不必知道他们的类。这些方法中的一些建立在自底向上的无类分割[9]。以这种方式获得的割片可以使用自上而下的反馈[14, 2, 4]进行评分。使用同样的动机,Alexe等,[1]使用廉价的分类器来对作为或不作为对象的假设目标进行评分,以此方式减少后续检测步骤的位置数量。这些方法可以认为是多层模型,以分割为第一层,以分割分类为后续层。尽管事实是,他们编码证实了感知原理,我们将表明,拥有更深的模型,充分学习后可以得到