Reading Note: DetNet: A Backbone network for Object Detection

本文提出了一种新型的骨干网络DetNet,该网络针对对象检测任务进行了优化,通过维持高空间分辨率并扩大感受野来解决传统预训练模型在对象检测上的固有问题。DetNet在ResNet-50的基础上进行改进,保持了前四阶段的设计,并通过扩张卷积引入了额外的阶段,同时确保特征图的空间分辨率固定为16倍下采样。

TITLE: DetNet: A Backbone network for Object Detection

AUTHOR: Xuepeng Shi, Shiguang Shan, Meina Kan, Shuzhe Wu, Xilin Chen

ASSOCIATION: Tsinghua University, Face++

FROM: arXiv:1804.06215

CONTRIBUTION

  1. The inherent drawbacks of traditional ImageNet pre-trained model for fine-tunning recent object detectors is analyzed.
  2. A novel backbone, called DetNet, is proposed, which is specifically designed for object detection task by maintaining the spatial resolution and enlarging the receptive field.

METHOD

Motivation

There are two problems using the classification backbone for object detection tasks. (i) Recent detectors, e.g., FPN, involve extra stages compared with the backbone network for ImageNet classification in order to detect objects with various sizes. (ii) Traditional backbone produces higher receptive field based on large downsampling factor, which is beneficial to the visual classification. However, the spatial resolution is compromised which will fail to accurately localize the large objects and recognize the small objects.

To sumarize, there are 3 main problems to use current pre-trained models, including

  1. The number of network stages is different. It means that extra layers for object detection compared to classification has not been pretrained.
  2. Weak visibility of large objects. It is because The feature map with strong semantic information has large strides respect to input image, which is harmful for the object localization.
  3. Invisibility of small objects. The information from the small objects will be easily weaken as the spatial resolution of the feature maps is decreased and the large context information is integrated.

To address these problems, DetNet has following characteristics. (i) The number of stages is directly designed for Object Detection. (ii) Even though more stages are involved, high spatial resolution of the feature maps is mainted, while keeping large receptive field using dilated convolution.

DetNet Design

The main architecture of DetNet is designed based on ResNet-50. The first 4 stages are kept same with ResNet-50. The main differences are illustrated as follows:

  1. The extra stages are merged into the backbone which will be later utilized for object detection as in FPN. Meanwhile, the spatial resolution is fixed as 16x downsampling even after stage 4.
  2. Since the spatial size is fixed after stage 4, in order to introduce a new stage, a dilated bottleneck with 1×1 1 × 1 convolution projection is utilized in the begining of the each stage. The dilation convolution efficiently enlarge the receptive field.
  3. Since dilated convolution is still time consuming, stage 5 and stage 6 keep the same channels as stage 4 (256 input channels for bottleneck block). This is different from traditional backbone design, which will double channels in a later stage.

The following figure shows the dialted bottleneck with 1×1 1 × 1 conv projection and the architecture of DetNet.

Framework

PERFORMANCE

Performance

### PanoFormer 模型简介 PanoFormer 是一种基于 Transformer 的架构设计,专门用于处理全景图像中的密集目标检测任务。该模型通过引入全局上下文感知机制来增强特征表示能力,从而更好地适应全景图像的特点[^1]。 全景图像是指覆盖整个场景视野的图像形式,其具有高分辨率、大范围视角等特点。然而,在这种类型的图像上执行密集目标检测是一项挑战性的任务,因为传统的卷积神经网络 (CNN) 很难捕捉到全局依赖关系以及应对尺度变化较大的对象。而 PanoFormer 利用了自注意力机制的优势,能够有效解决这些问题[^2]。 #### 架构特点 PanoFormer 主要由以下几个部分组成: 1. **多尺度特征提取模块**: 使用改进版的 Swin Transformer 或其他类似的分层结构作为骨干网路,以生成不同层次上的特征金字塔。 2. **全景分割头(Panoptic Segmentation Head)**: 结合实例级和语义级别的预测结果完成最终输出。此头部集成了动态卷积操作与位置编码技术,进一步提升了对于复杂背景下的小物体识别精度[^3]。 3. **交叉视图融合单元(Cross-View Fusion Unit, CVFU)**: 针对球面投影带来的几何失真问题提出了创新解决方案——即通过对相邻像素间的关系建模实现更精确的空间映射转换[^4]。 以下是简单的伪代码展示如何构建基本框架: ```python class PanoFormer(nn.Module): def __init__(self, backbone, neck=None, panoptic_head=None): super().__init__() self.backbone = backbone # e.g., SwinTransformer self.neck = neck # Optional feature fusion module self.panoptic_head = panoptic_head def forward(self, x): features = self.backbone(x) if self.neck is not None: features = self.neck(features) output = self.panoptic_head(features) return output ``` ### 实现细节 为了提高效率并减少计算成本,可以采用稀疏采样策略仅关注感兴趣区域(ROI),同时利用混合精度训练加速收敛过程[^5]。此外还可以探索不同的预处理方法比如鱼眼校正或者立方体贴图变换等方式优化输入数据质量以便于后续分析阶段获得更好的效果表现。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值