使用ConvNets进行集成识别，定位和检测

最新推荐文章于 2025-06-02 11:16:44 发布

baidu88vip

最新推荐文章于 2025-06-02 11:16:44 发布

阅读量2.7k

点赞数

CC 4.0 BY-SA版权

分类专栏：深度学习计算机视觉文章标签：卷积网络计算机视觉

本文链接：https://blog.youkuaiyun.com/baidu88vip/article/details/81045694

本文提出了一种使用卷积网络（ConvNets）进行分类、定位和检测的集成框架，该框架在ImageNet Large Scale Visual Recognition Challenge 2013中赢得了定位任务。通过学习预测对象边界，该方法提升了定位的准确性，并在分类和检测任务中表现出色。文章介绍了模型设计、训练、特征提取和多尺度分类等关键点，以及定位和检测的策略，包括回归网络和边界框预测的积累。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

使用ConvNets进行集成识别，定位和检测——OverFeatIntegrated Recognition, Localization and Detection using Convolutional Networks

（点击标题链接原文https://arxiv.org/abs/1312.6229）

Abstract摘要

We present an integrated framework for using Convolutional Networks for classification, localization and detection. 我们提出了一个使用卷积网络进行分类，定位和检测的集成框架。

We also introduce a
novel deep learning approach to localization by learning to predict object boundaries.我们还介绍一个
通过学习预测对象约束来实现定位的新型深度学习方法。

This integrated framework is the winner
of the localization task of the ImageNet Large Scale Visual Recognition Challenge
2013 (ILSVRC2013) and obtained very competitive results for the detection and
classifications tasks. 这个集成框架是赢家
ImageNet大规模视觉识别挑战的定位任务
2013年（ILSVRC2013）并获得了极具竞争力的检测结果
分类任务。

Finally, we release a feature extractor from our best model
called OverFeat.最后，我们从最好的模型中发布了一个特征提取器
叫做OverFeat。

1、Introduction简介

Recognizing the category of the dominant object in an image is a tasks to which Convolutional
Networks (ConvNets) [17] have been applied for many years识别图像中主要对象的类别是Convolutional的任务
网络（ConvNets）[17]已被应用多年。

The main advantage of ConvNets for many such tasks is that the entire system is trained end to
end, from raw pixels to ultimate categories, thereby alleviating the requirement to manually design
a suitable feature extractor. 卷积网络的优点：端到端The main disadvantage is their ravenous appetite for labeled training
samples.卷积网络的缺点：依赖于有标签的训练数据集。

The main point of this paper is to show that training a convolutional network to simultaneously
classify, locate and detect objects in images can boost the classification accuracy and the detection
and localization accuracy of all tasks.本文的重点是展示同时训练卷积网络
分类，定位和检测图像中的对象可以提高分类准确度和检测
和所有任务的定位准确性。 The paper proposes a new integrated approach to object
detection, recognition, and localization with a single ConvNet. 本文提出了一种新的对象集成方法
使用单个ConvNet进行检测，识别和定位。We also introduce a novel method for
localization and detection by accumulating predicted bounding boxes. 我们还介绍了一种新颖的方法
通过累积预测的边界框进行定位和检测。

解决图像大小、位置的问题办法：The first idea in addressing this is to apply a ConvNet at multiple
locations in the image, in a sliding window fashion, and over multiple scales.第一个想法是多次应用ConvNet
图像中的位置，滑动窗口方式以及多个比例。This leads to decent
classification but poor localization and detection. 这导致体面
分类但定位和检测不佳。the second idea is to train the system to not
only produce a distribution over categories for each window, but also to produce a prediction of the
location and size of the bounding box containing the object relative to the window.第二个想法是训练系统不
只为每个窗口生成一个类别的分布，而且还产生一个预测
包含相对于窗口的对象的边界框的位置和大小。The third idea is
to accumulate the evidence for each category at each location and size.第三个想法是
在每个位置和尺寸积累对应类别的置信度。

Several authors have also proposed to train ConvNets to directly predict the instantiation parameters
of the objects to be located一些作者还提出训练ConvNets直接预测实例化参数
要定位的对象
Hinton et al. have also proposed
to train networks to compute explicit instantiation parameters of features as part of a recognition
process [12]. Hinton等人。也提出了
训练网络过程中以计算特征的显式实例化参数作为识别的一部分[12]。

Other authors have proposed to perform object localization via ConvNet-based segmentation.其他作者提出通过基于ConvNet的分割来执行对象定位The
simplest approach consists in training the ConvNet to classify the central pixel (or voxel for volumetric images) of its viewing window as a boundary between regions or not [13].该
最简单的方法是训练ConvNet将其观察窗的中心像素（或体积图像的体素）分类为区域之间的边界或不是[13]。semantic segmentation. 语义分割The main idea is to
train the ConvNet to classify the central pixel of the viewing window with the category of the object it belongs to, using the window as context for the decision. 主要想法是
训练ConvNet使用窗口作为决策的上下文，将观察窗口的中心像素分类为它所属的对象的类别。The advantage of this approach is that the bounding contours need not be rectangles, and the regions need
not be well-circumscribed objects. The disadvantage is that it requires dense pixel-level labels for
training. 这种方法的优点是边界轮廓不必是矩形，而区域需要
不是界限清楚的物体。缺点是它需要密集的像素级标签
训练。

2、Vision Tasks视觉任务

In this paper, we explore three computer vision tasks in increasing order of difficulty: (i) classification, (ii) localization, and (iii) detection. Each task is a sub-task of the next.在本文中，我们以不断增加的难度顺序探索三种计算机视觉任务：（i）分类，（ii）定位，（iii）检测。每个任务都是下一个任务的子任务。

classification task 分类任务each image is assigned a single
label corresponding to the main object in the image. 每个图像都分配一个
标签对应于图像中的主要对象。Five guesses are allowed to find the correct
answer (this is because images can also contain multiple unlabeled objects).允许五个猜测找到正确的
回答（这是因为图像还可以包含多个未标记的对象）。