Rich feature hierarchies for accurate object detection and semantic segmentation——Tech report (v5)用于精确物体定位和语义分割的丰富特征层次结构——技术报告(第5版)Ross Girshick Jeff Donahue Trevor Darrell Jitendra MalikUC Berkeley加州大学伯克利分校{frbg,jdonahue,trevor,malikg}@eecs.berkeley.edu |
Abstract
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks
(CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/˜rbg/rcnn.
摘要
过去几年,在经典数据集PASCAL上,物体检测的效果已经达到一个稳定水平。效果最好的方法是融合了多种低维图像特征和高维上下文环境的复杂集成系统。在这篇论文里,我们提出了一种简单并且可扩展的检测算法,可以在VOC2012最好结果的基础上将mAP值提高30%以上——达到了53.3%。我们的方法结合了两个关键的思想:(1)在候选区域上自下而上使用大型卷积神经网络(CNNs),用以定位和分割物体。(2)当带标签的训练数据不足时,先针对辅助任务进行有监督预训练,再进行特定任务的调优,就可以产生明显的性能提升。
因为我们把region proposal和CNNs结合起来,所以该方法被称为R-CNN:Regions with CNN features。我们也把R-CNN效果跟OverFeat比较了下,OverFeat是最近提出的基于类似的CNN特征并采用滑动窗口进行目标检测的一种方法,结果发现R-CNN在200个类别的ILSVRC2013检测数据集上的性能明显优于OVerFeat。系统完整的源代码见:http://www.cs.berkeley.edu/˜rbg/rcnn。(译者注:网址已失效,github新地址:https://github.com/rbgirshick/rcnn)
1. Introduction
Features matter. The last decade of progress on various visual recognition tasks has been based considerably on the use of SIFT [29] and HOG [7]. But if we look at performance on the canonical visual recognition task, PASCAL VOC object detection [15], it is generally acknowledged that progress has been slow during 2010-2012, with small gains obtained by building ensemble systems and employing minor variants of successful methods.
1. 前言
特征的重要性。在过去十年,各类视觉识别任务基本都建立在对SIFT[29]和HOG[7]特征的使用。但如果我们关注一下PASCAL VOC对象检测[15]这个经典的视觉识别任务,就会发现2010-2012年进展缓慢,取得的微小进步都是通过构建一些集成系统和采用一些成功方法的变种才达到的。
SIFT and HOG are blockwise orientation histograms, a representation we could associate roughly with complex cells in V1, the first cortical area in the primate visual pathway. But we also know that recognition occurs several stages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features that are even more informative for visual recognition.
SIFT和HOG是块方向直方图(blockwise orientation histograms),一种类似大脑初级皮层V1层复杂细胞的表示方法。但我们知道识别发生在多个下游阶段,(我们是先看到了一些特征,然后才意识到这是什么东西)也就是说对于视觉识别来说,更有价值的信息,是层次化的,多个阶段的特征。
Fukushima’s “neocognitron” [19], a biologically-inspired hierarchical and shift-invariant model for pattern recognition, was an early attempt at just such a process. The neocognitron, however, lacked a supervised training algorithm. Building on Rumelhart et al. [33], LeCun et al. [26] showed that stochastic gradient descent via backpropagation was effective for training convolutional neural networks (CNNs), a class of models that extend the neocognitron.
Fukushima的“neocognitron,一种受生物学启发用于模式识别的层次化、移动不变性模型,算是这方面最早的尝试。然而,neocognitron缺乏监督训练算法。基于Rumelhart等人的研究,Lecun等人提出反向传播的随机梯度下降(SGD)对训练卷积神经网络(CNNs)非常有效,CNNs被认为是neocognitron的一种扩展。
CNNs saw heavy use in the 1990s (e.g., [27]), but then fell out of fashion with the rise of support vector machines. In 2012, Krizhevsky et al. [25] rekindled interest in CNNs by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [9, 10]. Their success resulted from training a large CNN on 1.2 million labeled images, together with a few twists on LeCun’s CNN (e.g., max(x; 0) rectifying non-linearities and “dropout” regularization).
CNNs在1990年代被广泛使用,但之后随着SVM的崛起便淡出研究主流。2012年,Krizhevsky等人在ImageNet大规模视觉识别挑战赛(ILSVRC)上的出色表现重新燃起了对CNNs的兴趣。他们的成功在于在120万标注的图像上使用了一个大型的CNN,并且对LeCUN的CNN进行了一些改造(比如ReLU和Dropout正则化)。
The significance of the ImageNet result was vigorously debated during the ILSVRC 2012 workshop. The central issue can be distilled to the following: To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?
这个ImangeNet的结果的重要性在ILSVRC2012 workshop上得到了热烈的讨论。提炼出来的核心问题是:ImageNet上的CNN分类结果在何种程度上能够应用到PASCAL VOC挑战的物体检测任务上?
We answer this question by bridging the gap between image classification and object detection. This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features. To achieve this result, we focused on two problems: localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data.
我们通过弥合图像分类和目标检测差别,回答了这个问题。本论文是第一个说明在PASCAL VOC的物体检测任务上CNN比基于简单的类似HOG特征的系统有大幅的性能提升。我们主要关注了两个问题:使用深度网络定位目标和在小规模的标注数据集上进行大型网络模型的训练。
Unlike image classification, detection requires localizing (likely many) objects within an image. One approach frames localization as a regression problem. However, work from Szegedy et al. [38], concurrent with our own, indicates that this strategy may not fare well in practice (they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method). An alternative is to build a sliding-window detector. CNNs have been used in this way for at least two decades, typically on constrained object categories, such as faces [32, 40] and pedestrians [35]. In order to maintain high spatial resolution, these CNNs typically only have two convolutional and pooling layers. We also considered adopting a sliding-window approach. However, units high up in our network, which has five convolutional layers, have very large receptive fields (195*195 pixels) and strides (32*32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.
与图像分类不同,目标检测需要定位图像中的目标(可能有多个)。一个方法是将框定位看做是回归问题。但Szegedy等人的研究以及我们自己的研究表明这种策略在实际应用中并不可行(在VOC2007上他们的mAP是30.5%,而我们的达到了58.5%)。另一种方法是使用滑动窗口检测器。通过这种方法使用CNNs至少已经有20年的时间了,通常用于一些特定目标种类的检测,例如人脸检测、行人检测等。为了获得较高的空间分辨率,这些CNNs普遍采用了两个卷积层和两个池化层。我们本来也考虑过使用滑动窗口的方法。但是由于我们网络有5个卷积层,具有更深的层和更多的神经元,使得输入图片有非常大的感受野(195×195)和步长(32×32),这使得采用滑动窗口的精确定位方法充满挑战。
Instead, we solve the CNN localization problem by operating within the “recognition using regions” paradigm [21], which has been successful for both object detection [39] and semantic segmentation [5]. At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape. Figure 1 presents an overview of our method and highlights some of our results. Since our system combines region proposals with CNNs, we dub the method R-CNN: Regions with CNN features.
为了解决CNN的定位,我们是通过操作”recognition using regions”范式,这种方法已经成功用于目标检测和语义分隔。测试时,每张图片产生了接近2000个与类别无关的region proposal,然后分别通过CNN提取了一个固定长度的特征向量,最后使用特定类别的线性SVM对每个region进行分类。不论region的形状,我们使用一种简单的方法(仿射图像变形)将每个region proposal转换成固定尺寸的大小作为CNN的输入。图1展示了我们方法的全貌并突出展示了一些实验结果。由于我们的模型结合了Region proposals和CNNs,所以把这种方法称为R-CNN,即Regions with CNN features。
Figure 1: Object detection system overview. Our system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs. R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. For comparison, [39] reports 35.1% mAP using the same region proposals, but with a spatial pyramid and bag-of-visual-words approach. The popular deformable part models perform at 33.4%. On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat [34], which had the previous best result at 24.3%.
图1:物体检测系统概述。我们的系统(1)输入一张图像、(2)提取大约2000个自下而上的region proposals、(3)使用大型卷积神经网络(CNN)计算每个region proposals的特征向量、(4)使用特定类别的线性SVM对每个region进行分类。R-CNN在PASCAL VOC 2010上的mAP为53.7%。对比[39]文献中使用相同region proposals方法,并用使用采用空间金字塔和bag-of-visual-words方法的模型,其mAP只有35.1%。流行的可变部件的模型(DPM)的性能也只有33.4%。在200个类别的ILSVRC2013检测数据集上,R-CNN的mAP为31.4%,比最佳结果24.3%OverFeat有了很大改进。
In this updated version of this paper, we provide a head-to-head comparison of R-CNN and the recently proposed OverFeat [34] detection system by running R-CNN on the 200-class ILSVRC2013 detection dataset. OverFeat uses a sliding-window CNN for detection and until now was the best performing method on ILSVRC2013 detection. We show that R-CNN significantly outperforms OverFeat, with a mAP of 31.4% versus 24.3%.
本篇论文的更新版本中,我们提供了R-CNN和最近提出的OverFeat检测系统在ILSVRC2013的200分类检测数据集上对比结果。OverFeat使用了滑动窗口CNN做检测,目前为止是ILSVRC2013检测集上表现最好的方法。我们的结果显示,R-CNN的mAP达到31.4%,显著超越了OverFeat的24.3%的结果。
A second challenge faced in detection is that labeled data is scarce and the amount currently available is insufficient for training a large CNN. The conventional solution to this problem is to use unsupervised pre-training, followed by supervised fine-tuning (e.g., [35]). The second principle contribution of this paper is to show that supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain-specific fine-tuning on a small dataset (PASCAL), is an effective paradigm for learning high-capacity CNNs when data is scarce. In our experiments, fine-tuning for detection improves mAP performance by 8 percentage points. After fine-tuning, our system achieves a mAP of 54% on VOC 2010 compared to 33% for the highly-tuned, HOG-based deformable part model (DPM) [17, 20]. We also point readers to contemporaneous work by Donahue et al. [12], who show that Krizhevsky’s CNN can be used (without finetuning) as a blackbox feature extractor, yielding excellent performance on several recognition tasks including scene classification, fine-grained sub-categorization, and domain adaptation.
检测任务第二个挑战是标签数据太少,手头可用于训练一个大型卷积网络的数据往往很不充足。传统解决方法通常采用无监督预训练,再进行有监督调优的方法(如[35])。本文的第二个核心贡献是在辅助数据集(ILSVRC)上进行有监督预训练,再在小数据集(PASCAL)上针对特定问题进行调优,这种方法在训练数据稀少的情况下训练大型卷积神经网络是非常有效的。我们的实验中,针对检测的调优将mAP提高了8个百分点。调优后,我们的系统在VOC2010上达到了54%的mAP,远远超过高度优化的基于HOG的可变性部件模型(deformable part model,DPM)[17, 20]。另外提醒读者朋友们关注Donahue等人同时期的工作,他们的研究表明Krizhevsky的CNN(译者注:Krizhevsky的CNN指AlexNet网络)可以用来作为一个黑盒特征提取器,(没有调优的情况下)在多个识别任务上包括场景分类、细粒度的子分类和领域自适应(domain adaptation)方面均表现很出色。
Our system is also quite efficient. The only class-specific computations are a reasonably small matrix-vector product and greedy non-maximum suppression. This computational property follows from features that are shared across all categories and that are also two orders of magnitude lower-dimensional than previously used region features (cf. [39]).
同时,我们的模型系统也很高效。都是相当小型的矩阵向量相乘和贪婪的非极大值抑制这些特定类别的计算。这个计算特性源自于所有类别都共享的特征,同时这些特征比之前使用的区域特征([39])少了两个数量级的维度。
Understanding the failure modes of our approach is also critical for improving it, and so we report results from the detection analysis tool of Hoiem et al. [23]. As an immediate consequence of this analysis, we demonstrate that a simple bounding-box regression method significantly reduces mislocalizations, which are the dominant error mode.
分析我们方法的失败案例对进一步改进和提升很有帮助,所以我们借助Hoiem等人的定位分析工具[23]做实验结果的报告和分析。作为本次分析的直接结果,我们发现一个简单的边界框回归的方法会明显地降低错误定位问题,而错误定位是我们的模型系统的主要误差。
Before developing technical details, we note that because R-CNN operates on regions it is natural to extend it to the task of semantic segmentation. With minor modifications, we also achieve competitive results on the PASCAL VOC segmentation task, with an average segmentation accuracy of 47.9% on the VOC 2011 test set.
开发技术细节之前,我们注意到由于R-CNN是在推荐区域上进行操作,所以可以很自然地扩展到语义分割任务上。经过微小的改动,我们就在PASCAL VOC语义分割任务上达到了很有竞争力的结果,在VOC 2011测试集上平均语义分割精度达到了47.9%。
2. Object detection with R-CNN
Our object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of classs-pecific linear SVMs. In this section, we present our design decisions for each module, describe their test-time usage, detail how their parameters are learned, and show detection results on PASCAL VOC 2010-12 and on ILSVRC2013.
2. 使用R-CNN做物体检测
我们的物体检测系统有三个模块构成。第一个模块产生类别无关的region proposals。这些proposals组成了一个模型可用的候选检测区域的集合。第二个模块是一个大型卷积神经网络,从每个region提取固定长度的特征向量。第三个模块是特定类别线性SVM的集合。这一节将展示每个模块的设计,并介绍它们的测试阶段的用法,以及一些参数学习的细节,并得出在PASCAL VOC 2010-12和ILSVRC2013上的检测结果。
2.1. Module design
Region proposals. A variety of recent papers offer methods for generating category-independent region proposals. Examples include: objectness [1], selective search [39], category-independent object proposals [14], constrained parametric min-cuts (CPMC) [5], multi-scale combinatorial grouping [3], and Cires¸an et al. [6], who detect mitotic cells by applying a CNN to regularly-spaced square crops, which are a special case of region proposals. While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior detection work (e.g., [39, 41]).
2.1 模块设计
区域推荐(Region Proposals)。近来有很多研究都提出了生成类别无关的区域推荐的方法。比如objectness [1]、selective search [39]、category-independent object proposals [14]、constrained parametric min-cuts (CPMC) [5]、multi-scale combinatorial grouping [3],以及Ciresan等人提出的将CNN用在规律空间块裁剪上以检测有丝分裂细胞的方法,也算是一种特殊的区域推荐类型。由于R-CNN对特定区域算法是不关心的,所以我们采用了selective search以方便和之前的工作[39, 41]进行可控的比较。
Feature extraction. We extract a 4096-dimensional feature vector from each region proposal using the Caffe [24] implementation of the CNN described by Krizhevsky et al. [25]. Features are computed by forward propagating a mean-subtracted 227*227 RGB image through five convolutional layers and two fully connected layers. We refer readers to [24, 25] for more network architecture details.
特征提取。我们使用Krizhevsky等人[25]所描述的CNN(译者注:AlexNet)的一个Caffe[24]实现版本对每个推荐区域提取一个4096维度的特征向量。减去像素均值的277×277大小的RGB输入图像通过五个卷积层和两个全连接层,最终计算得到特征向量。读者可以参考[24, 25]获得更多的网络架构细节。
In order to compute features for a region proposal, we must first convert the image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227*22