目标检测概述:一步法 An overview of object detection: one-stage methods

本文主要介绍了目标检测中的一步法,包括直接对象预测、理解任务、YOLO和SSD模型的详细解析,以及如何通过焦点损失解决物体不平衡问题。文章探讨了一阶段方法在检测图像中固定数量的可能对象时的优势,同时也提到了非最大抑制的重要性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

An overview of object detection: one-stage methods 目标检测概述:一步法

(点击标题链接原文https://www.jeremyjordan.me/object-detection-one-stage/


Object detection is useful for understanding what’s in an image, describing both what is in an image and where those objects are found.

目标检测是了解什么是图像中,描述都是有用的东西是一个图像并在那里被发现的那些对象。

In general, there’s two different approaches for this task – we can either make a fixed number of predictions on grid (one stage) or leverage a proposal network to find objects and then use a second network to fine-tune these proposals and output a final prediction (two stage).

一般来说,这项任务有两种不同的方法:

  1. 我们既可以在网格上进行固定数量的预测(一个阶段),
  2. 也可以利用提案网络查找对象,然后使用第二个网络对这些提案进行微调并输出最终结果预测(两阶段)。

Each approach has its own strengths and weaknesses

每种方法都有自己的优点和缺点

Understanding the task 理解任务

The goal of object detection is to recognize instances of a predefined set of object classes (e.g. {people, cars, bikes, animals}) and describe the locations of each detected object in the image using a bounding box.

目标检测的目标是识别预定义的一组对象类(例如{人,汽车,自行车,动物})的实例,并使用边界框描述图像中每个检测到的对象的位置。

We’ll use rectangles to describe the locations of each object. An alternative approach would be image segmentation which provides localization at the pixel-level.

我们将使用矩形来描述每个对象的位置。另一种方法是图像分割,其提供像素级的定位。

Direct object prediction 直接对象预测

This blog post will focus on model architectures which directly predict object bounding boxes for an image in a one-stage fashion.

这篇博文将重点介绍模型体系结构,它以一阶段的方式直接预测图像的对象边界框。

Predictions on a grid 对网络的预测

We’ll refer to this part of the architecture as the “backbone” network, which is usually pre-trained as an image classifier to more cheaply learn how to extract features from an image.

我们将架构的这一部分称为“骨干”网络,它通常被预先训练为图像分类器,以便更便宜地学习如何从图像中提取特征。

After pre-training the backbone architecture as an image classifier, we’ll remove the last few layers of the network so that our backbone network outputs a collection of stacked feature maps which describe the original image in a low spatial resolution albeit a high feature (channel) resolution.

在将骨干架构作为图像分类器进行预训练之后,我们将删除网络的最后几层,以便我们的骨干网络输出一组堆叠的特征图,这些图以低空间分辨率描述原始图像,尽管具有高特征(频道)决议。

Coarse spatial representation with rich feature description of original image

原始图像特征丰富描述和粗糙空间表示

We can relate this 7x7 grid back to the original input in order to understand what each grid cell represents relative to the original image.

我们可以将这个7x7网格与原始输入相关联,以便了解每个网格单元相对于原始图像的表示。

We can also determine roughly where objects are located in the coarse (7x7) feature maps by observing which grid cell contains the center of our bounding box annotation. We’ll assign this grid cell as being “responsible” for detecting that specific object.

我们还可以通过观察哪个网格单元包含我们的边界框注释的中心来粗略地确定对象在粗(7x7)要素图中的位置。我们将此网格单元指定为“负责”以检测该特定对象。

In order to detect this object, we will add another convolutional layer and learn the kernel parameters which combine the context of all 512 feature maps in order to produce an activation corresponding with the grid cell which contains our object.

为了检测这个对象,我们将添加另一个卷积层并学习内核参数,这些参数组合了所有512个特征映射的上下文,以便产生与包含我们对象的网格单元相对应的激活。

If the input image contains multiple objects, we should have multiple activations on our grid denoting that an object is in each of the activated regions.

如果输入图像包含多个对象,我们应该在网格上进行多次激活,表示对象位于每个激活区域中。

However, we cannot sufficiently describe each object with a single activation. In order to fully describe a detected object, we’ll need to define:

但是,我们无法通过单次激活来充分描述每个对象。为了完整描述检测到的对象,我们需要定义:

  • The likelihood that a grid cell contains an object \( (p_{obj}) \)
  • 网络单元包含对象的可能性\( (p_{obj}) \)
### EfficientNet Detection Head Architecture and Implementation In the context of object detection, particularly within frameworks like MMDetection that aim to provide high-quality codebases with unified benchmarks[^1], integrating an efficient yet powerful detection head is crucial. For models based on EfficientNet as a backbone, several key aspects define how these detection heads are structured. #### Model Structure Overview The EfficientNet-based detector typically adopts a feature pyramid network (FPN) structure or its variants such as BiFPN (Bidirectional Feature Pyramid Network). This allows for multi-scale feature extraction which enhances performance especially when dealing with objects at various scales. The FPN constructs higher-level semantic features by fusing information from different levels of convolutional layers in the backbone network[^2]. For instance, in implementations using EfficientDet—a family of scalable and highly accurate detectors built upon EfficientNet—the detection head consists of two main components: - **Class Subnet**: Predicts class scores per anchor. - **Box Subnet**: Regresses bounding box coordinates relative to anchors. Each subnet applies multiple shared convolutions followed by one final layer specific to either classification or regression tasks. These subnets operate independently but share weights across all spatial locations and aspect ratios, promoting efficiency while maintaining effectiveness. ```python class EfficientDetHead(nn.Module): def __init__(self, num_classes=90, channel=256, repeats=4): super(EfficientDetHead, self).__init__() # Class prediction branch self.class_subnet = nn.Sequential( *[nn.Conv2d(channel, channel, kernel_size=3, padding=1), nn.ReLU(inplace=True)] * repeats, nn.Conv2d(channel, num_anchors * num_classes, kernel_size=3, padding=1)) # Box regression branch self.box_subnet = nn.Sequential( *[nn.Conv2d(channel, channel, kernel_size=3, padding=1), nn.ReLU(inplace=True)] * repeats, nn.Conv2d(channel, num_anchors * 4, kernel_size=3, padding=1)) def forward(self, x): cls_out = self.class_subnet(x) bbox_out = self.box_subnet(x) return cls_out, bbox_out ``` This design ensures both speed and accuracy improvements over traditional single-stage detectors due to better utilization of resources through weight sharing between branches and across scales. #### Integration into Object Detection Frameworks When incorporating this type of detection head into platforms like MMDetection, developers often extend existing classes rather than modifying core functionalities directly. By subclassing `SingleStageDetector` and overriding necessary methods, custom architectures can be seamlessly integrated without disrupting other parts of the system[^3]. Specifically, defining new backbones via registration mechanisms enables flexible experimentation with diverse base networks including those derived from EfficientNet series. --related questions-- 1. How does the integration process differ among popular deep learning libraries? 2. What optimizations techniques could further enhance the performance of EfficientNet-based detectors? 3. Can you explain more about bidirectional feature pyramids used alongside EfficientNet? 4. Are there any notable applications where EfficientNet has shown superior results compared to alternatives?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值