An overview of object detection: one-stage methods 目标检测概述:一步法
(点击标题链接原文https://www.jeremyjordan.me/object-detection-one-stage/)
Object detection is useful for understanding what’s in an image, describing both what is in an image and where those objects are found.
目标检测是了解什么是图像中,描述都是有用的东西是一个图像并在那里被发现的那些对象。
In general, there’s two different approaches for this task – we can either make a fixed number of predictions on grid (one stage) or leverage a proposal network to find objects and then use a second network to fine-tune these proposals and output a final prediction (two stage).
一般来说,这项任务有两种不同的方法:
- 我们既可以在网格上进行固定数量的预测(一个阶段),
- 也可以利用提案网络查找对象,然后使用第二个网络对这些提案进行微调并输出最终结果预测(两阶段)。
Each approach has its own strengths and weaknesses
每种方法都有自己的优点和缺点
Understanding the task 理解任务
The goal of object detection is to recognize instances of a predefined set of object classes (e.g. {people, cars, bikes, animals}) and describe the locations of each detected object in the image using a bounding box.
目标检测的目标是识别预定义的一组对象类(例如{人,汽车,自行车,动物})的实例,并使用边界框描述图像中每个检测到的对象的位置。
We’ll use rectangles to describe the locations of each object. An alternative approach would be image segmentation which provides localization at the pixel-level.
我们将使用矩形来描述每个对象的位置。另一种方法是图像分割,其提供像素级的定位。
Direct object prediction 直接对象预测
This blog post will focus on model architectures which directly predict object bounding boxes for an image in a one-stage fashion.
这篇博文将重点介绍模型体系结构,它以一阶段的方式直接预测图像的对象边界框。
Predictions on a grid 对网络的预测
We’ll refer to this part of the architecture as the “backbone” network, which is usually pre-trained as an image classifier to more cheaply learn how to extract features from an image.
我们将架构的这一部分称为“骨干”网络,它通常被预先训练为图像分类器,以便更便宜地学习如何从图像中提取特征。
After pre-training the backbone architecture as an image classifier, we’ll remove the last few layers of the network so that our backbone network outputs a collection of stacked feature maps which describe the original image in a low spatial resolution albeit a high feature (channel) resolution.
在将骨干架构作为图像分类器进行预训练之后,我们将删除网络的最后几层,以便我们的骨干网络输出一组堆叠的特征图,这些图以低空间分辨率描述原始图像,尽管具有高特征(频道)决议。
Coarse spatial representation with rich feature description of original image
原始图像特征丰富描述和粗糙空间表示
We can relate this 7x7 grid back to the original input in order to understand what each grid cell represents relative to the original image.
我们可以将这个7x7网格与原始输入相关联,以便了解每个网格单元相对于原始图像的表示。
We can also determine roughly where objects are located in the coarse (7x7) feature maps by observing which grid cell contains the center of our bounding box annotation. We’ll assign this grid cell as being “responsible” for detecting that specific object.
我们还可以通过观察哪个网格单元包含我们的边界框注释的中心来粗略地确定对象在粗(7x7)要素图中的位置。我们将此网格单元指定为“负责”以检测该特定对象。
In order to detect this object, we will add another convolutional layer and learn the kernel parameters which combine the context of all 512 feature maps in order to produce an activation corresponding with the grid cell which contains our object.
为了检测这个对象,我们将添加另一个卷积层并学习内核参数,这些参数组合了所有512个特征映射的上下文,以便产生与包含我们对象的网格单元相对应的激活。
If the input image contains multiple objects, we should have multiple activations on our grid denoting that an object is in each of the activated regions.
如果输入图像包含多个对象,我们应该在网格上进行多次激活,表示对象位于每个激活区域中。
However, we cannot sufficiently describe each object with a single activation. In order to fully describe a detected object, we’ll need to define:
但是,我们无法通过单次激活来充分描述每个对象。为了完整描述检测到的对象,我们需要定义:
- The likelihood that a grid cell contains an object \( (p_{obj}) \)
- 网络单元包含对象的可能性\( (p_{obj}) \)