Reading notes--Deformable Convolution Networks

本文介绍了一种增强卷积神经网络(CNN)变换建模能力的新方法,即可变形卷积和可变形RoI池化。这两种方法通过在标准CNN中增加偏移量来实现自由形式的变形,且无需额外监督即可通过反向传播进行端到端训练。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Reading notes–Deformable Convolution Networks

This article is just used to record the important part of original paper which I think can help understanding.

  In this work, we introduce two new models to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI polling. Both are based on the idea of augmenting the spacial sampling locations in models with additional offsets and learning the offsets from the target tasks, without additional supervision.
This new models can be esaily trained end-to-end by standard back-propagation, giving rise to deformable convolution networks.

Deformable Concolution

  Deformable convolution adds 2D offsets to regular grid sampling locations in standard convolution. It enables free form deformation of of the sampling grid. It is illustrated in Figure 1. The offset are learned from the preceding feature maps, via additional convolution layers.

Figure 1

  The 2D convolution consists of two steps: 1) sampling using a regular grid R over the input feature map x; 2) summation of sampled values weighted by w. The grid R defines the receptive field size and dilation. For example,

R={(1,1),(1,0),...,(0,1),(1,1)}R={(−1,−1),(−1,0),...,(0,1),(1,1)}

defines a 3 x 3 kernel with dilation 1.
  For each location P0P0 on the output feature map y, we have
y(p0)=pnRw(pn)x(p0+pn),(1)(1)y(p0)=∑pn∈Rw(pn)⋅x(p0+pn),

  where pnpn enumerates the locations in R.
  In deformable convolution, the regular grid R is augmented with offsets {Δpn|n=1,...,N}{Δpn|n=1,...,N},where N = |R|. Eq.(1) becomes

y(p0)=pnRw(pn)x(p0+pn+Δpn).(2)(2)y(p0)=∑pn∈Rw(pn)⋅x(p0+pn+Δpn).

  Now, the sampling is on the irregular and offset locations pn+Δpnpn+Δpn. As the offset ΔpnΔpn is typically fractional, Eq.(2) is implemented via bilinear interpolation as

x(p)=qG(q,p)x(q),(3)(3)x(p)=∑qG(q,p)⋅x(q),

where p denotes an arbitrary (fractional) location (p=p0+pn+Δpnp=p0+pn+Δpn for Eq.(2)), q enumerates all integral spacial locations in the future map x, and G(,)G(⋅,⋅) is the bilinear interpolation kernel. Note that G is two dimensional. It is separated into two one dimensional kernels as
G(q,p)=g(qx,px)g(qy,py),(4)(4)G(q,p)=g(qx,px)⋅g(qy,py),

where g(a,b)=max(0,1|ab|)g(a,b)=max(0,1−|a−b|). Eq.(3) is fast to compute as G(q,p)G(q,p) is non-zero only for a few qqs.

Deformable RoI pooling

  Deformable RoI pooling adds an offset to each bin position in regular bin partition of the previous RoI pooling. Similiarly, the offsets are learned from the preceding feature maps and RoIs, enabling adaptive part localization for objects with different shapes.

  RoI pooling is used in all region proposal based object detection methods. It coverts an input rectangular region of arbitrary size into fixed size feature.
  RoI Pooling Given the input feature x and a RoI of size w x h and top-left corner P0, RoI pooling divides the RoI into k x k(k is a free parameter) bins and outputs a k x k feature map y. For (i,j)-th bin (0<=i,j<k),(0<=i,j<k),we have

y(i,j)=pbin(i,j)x(p0+p)/nij,(5)(5)y(i,j)=∑p∈bin(i,j)x(p0+p)/nij,

where nijnij is the number of pixels in the bin.The (i,j)-th bin spans iwkpx<(i+1)wkandjhkpy<⌊iwk⌋≤px<⌈(i+1)wk⌉and⌊jhk⌋≤py< (j+1)hk⌈(j+1)hk⌉
Similarly as in Eq.(2), in deformable RoI pooling, offsets{Δpij|0i,j<k}{Δpij|0≤i,j<k}are added to the spatial binning positions. Eq.(5)becomes

y(i,j)=pbin(i,j)x(p0+p+Δpij)/nij,(6)(6)y(i,j)=∑p∈bin(i,j)x(p0+p+Δpij)/nij,

Typically,ΔpijΔpij is fractional. Eq.(6) is implemented by bilinear interpolation via Eq.(3) and Eq.(4).

### RT-DETR中的可变形注意力机制 在RT-DETR中,为了增强模型对于不同尺度特征的感受野并提高其灵活性,采用了基于Deformable DETR的改进版本——即引入了可变形注意力机制。这种机制允许每个查询位置自适应地聚焦于输入图像的不同空间区域,从而更好地捕捉目标物体的关键部分。 具体来说,在标准的多头自注意层基础上加入了偏移量参数来调整采样点的位置。这些偏移量通过额外的学习过程获得,并且可以根据具体的任务需求动态变化。这使得即使是在复杂场景下也能更精准地定位和识别对象[^2]。 #### 实现细节 以下是关于如何在RT-DETR框架内集成deformable attention的具体步骤: 1. **定义偏移量生成模块**:创建一个新的子网络用于计算每个多头关注下的相对位移向量; 2. **修改原有的Attention Layer**:将传统固定网格替换为由上述子网输出指导的新坐标系; 3. **训练过程中更新权重**:确保整个系统的梯度能够回传至负责产生offsets的部分,以便优化它们的表现效果。 #### 示例代码 下面给出一段简化版Python伪代码片段展示怎样构建这样一个含有deformable attention组件的Transformer解码器单元: ```python import torch.nn as nn from torchvision.ops import DeformConv2d class DeformableAttnLayer(nn.Module): def __init__(self, d_model=256, nhead=8): super().__init__() self.offset_net = nn.Sequential( nn.Linear(d_model, 49 * 2), # 假设有7x7个采样点,则需预测两个方向上的偏移 nn.Tanh() ) self.deform_attn = MultiHeadedDeformableAttention(nhead=nhead) def forward(self, query, key, value): offsets = self.offset_net(query).view(-1, 49, 2) output = self.deform_attn(query=query, key=key, value=value, offset=offsets) return output def build_transformer_decoder_with_deformable_attention(): decoder_layer = TransformerDecoderLayerWithDeformableAttention(...) transformer_decoder = TransformerDecoder(decoder_layer, num_layers=...) return transformer_decoder ``` 此段代码展示了如何在一个简单的例子中加入deformable attention到现有的transformer架构里去。实际应用时还需要考虑更多因素如性能调优等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值