用于精确对象检测和语义分割的丰富特征层次Rich feature hierarchies for accurate object detection and semantic segmentation

用于精确对象检测和语义分割的丰富特征层次—Rich feature hierarchies for accurate object detection and semantic segmentation

(点击标题链接原文https://arxiv.org/abs/1311.2524

↓ 辅助阅读,帮助理解 ↓

r-cnn-ilsvrc2013-workshop.pdf

Abstract

The best-performing methods are complex ensemble systems that typically combine multiple low-level
image features with high-level context.物体检测:表现最佳的方法是复杂的集合系统,通常将多个低级图像特征与高级上下文相结合。In this paper, we
propose a simple and scalable detection algorithm在本文中,我们提出了一种简单且可扩展党的检测算法。 Our approach combines two key insights:
(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to
localize and segment objects and (2) when labeled training
data is scarce, supervised pre-training for an auxiliary task,
followed by domain-specific fine-tuning, yields a significant
performance boost.我们的方法结合了两个关键的见解:(1)高容量卷积神经网络(CNN)可以应用于自下而上的区域提议,以定位和分割对象; (2)当标记的训练数据很少时当监督辅助任务的预训练然后执行特定领域的微调时,可以显着提高性能。Since we combine region proposals
with CNNs, we call our method R-CNN: Regions with CNN
features. 由于我们将候选区域与CNN结合起来,我们将该方法称为R-CNN:具有CNN功能的区域。

1. Introduction

Features matter. 特征是重要的。recognition occurs several
stages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features that
are even more informative for visual recognition.识别发生在下游几个阶段,这表明可能存在用于计算特征的分层的多阶段过程,多极化法计算的特点,甚至是更丰富的视觉识别。

The neocognitron, however, lacked a supervised training
algorithm. 然而,新认知缺乏监督训练算法。stochastic gradient descent via backpropagation was effective for training convolutional neural
networks (CNNs), a class of models that extend the neocognitron.通过反向传播的随机梯度下降对于训练卷积神经网络(CNNs)是一种有效的训练方法,CNN是一类扩展新神经元的模型。

The central
issue can be distilled to the following: To what extent do
the CNN classification results on ImageNet generalize to
object detection results on the PASCAL VOC Challenge? 中心问题可以归结为以下几点:ImageNet的CNN分类结果在多大程度上推广到PASCAL VOC挑战的对象检测结果?

We answer this question by bridging the gap between
image classification and object detection.我们解决了这个问题,弥补了图像分类和目标检测之间的差距。 we focused on two problems: localizing objects
with a deep network and training a high-capacity model
with only a small quantity of annotated detection data.我们专注于两个问题:使用深度网络定位对象和训练一个高容量的模型,只有少量的注释检测数据。

Unlike image classification, detection requires localizing (likely many) objects within an image.与图像分类不同,检测需要在图像内定位(可能多个)对象。One approach
frames localization as a regression problem. 一种方法将定位框架化为回归问题。An alternative is to build a
sliding-window detector.另一种方法是建立一个滑动窗口检测器。In order
to maintain high spatial resolution, these CNNs typically
only have two convolutional and pooling layers.为了保持高空间分辨率,这些CNNs通常只具有两个卷积和池化层。However,
units high up in our network, which has five convolutional
layers, have very large receptive fields (195 × 195 pixels)
and strides (32×32 pixels) in the input image, which makes
precise localization within the sliding-window paradigm an
open technical challenge.然而,在我们的网络中具有五个卷积层的单位在输入图像中具有非常大的接受域(195×195像素)和步幅(32×32像素),这使得在滑动窗口范式内的精确定位成为一个开放的技术挑战。

Instead, we solve the CNN localization problem by operating within the “recognition using regions” paradigm [21],
which has been successful for both object detection [39] and
semantic segmentation [5]. 取而代之的是,我们通过在“识别使用区域”范式中操作来解决CNN定位问题,该范式已成功用于对象检测和语义分割。 extracts a fixed-length feature vector from
each proposal using a CNN使用CNN从每个候选中提取固定长度的特征向量 classifies each region
with category-specific linear SVMs使用类别特定的线性SVM对每个区域进行分类。Since our system combines
region proposals with CNNs, we dub the method R-CNN:
Regions with CNN features.由于我们的系统将候选区域与CNNs结合起来,我们称之为R-CNN方法:具有CNN特征的区域。

Figure 1: Object detection system overview. Our system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs. R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. For comparison, [39] reports 35.1% mAP using the same region proposals, but with a spatial pyramid and bag-of-visual-words approach. The popular deformable part models perform at 33.4%. On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat [34], which had the previous best result at 24.3%.图1:对象检测系统概述。 我们的系统(1)采用输入图像,(2)提取大约2000个自下而上区域提议,(3)使用大型卷积神经网络(CNN)计算每个提议的特征,然后(4)使用类对每个区域进行分类 特定的线性SVM。 R-CNN在PASCAL VOC 2010上实现了53.7%的平均精确度(mAP)。相比之下,[39]使用相同的区域位置报告了35.1%的mAP,但是具有空间金字塔和bag-of-visual-words方法。 流行的可变形部分模型的性能为33.4%。 在200级ILSVRC2013检测数据集中,R-CNN的mAP为31.4%,比OverFeat [34]有很大改进,其中之前的最佳结果为24.3%。

A second challenge faced in detection is that labeled data is scarce and the amount currently available is insufficient for training

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值