Reading notes--Deformable Convolution Networks

最新推荐文章于 2025-05-14 22:27:44 发布

wsq1920

最新推荐文章于 2025-05-14 22:27:44 发布

阅读量247

点赞数

CC 4.0 BY-SA版权

分类专栏： deep learning 文章标签： Semantic Segmention

本文链接：https://blog.youkuaiyun.com/m0_37718446/article/details/80647589

deep learning 专栏收录该内容

1 篇文章

订阅专栏

本文介绍了一种增强卷积神经网络(CNN)变换建模能力的新方法，即可变形卷积和可变形RoI池化。这两种方法通过在标准CNN中增加偏移量来实现自由形式的变形，且无需额外监督即可通过反向传播进行端到端训练。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Reading notes–Deformable Convolution Networks

This article is just used to record the important part of original paper which I think can help understanding.

In this work, we introduce two new models to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI polling. Both are based on the idea of augmenting the spacial sampling locations in models with additional offsets and learning the offsets from the target tasks, without additional supervision.
This new models can be esaily trained end-to-end by standard back-propagation, giving rise to deformable convolution networks.

Deformable Concolution

Deformable convolution adds 2D offsets to regular grid sampling locations in standard convolution. It enables free form deformation of of the sampling grid. It is illustrated in Figure 1. The offset are learned from the preceding feature maps, via additional convolution layers.

The 2D convolution consists of two steps: 1) sampling using a regular grid R over the input feature map x; 2) summation of sampled values weighted by w. The grid R defines the receptive field size and dilation. For example,

R = {(- 1, - 1), (- 1, 0), . . ., (0, 1), (1, 1)}

$R = \{(-1,-1),(-1,0),...,(0,1),(1,1)\}$
defines a 3 x 3 kernel with dilation 1.
For each location

P0P0 $P_0$ on the output feature map y, we have

y (p 0) = \sum p n \in R w (p n) \cdot x (p 0 + p n), (1)

$\begin{equation} y(p_0) = \sum _{p_n\in R} w(p_n)\cdot x(p_0+p_n)\tag{1}, \end{equation}$

where $p_n$ enumerates the locations in R.
In deformable convolution, the regular grid R is augmented with offsets $\{\Delta p_n|n = 1,...,N\}$ ,where N = |R|. Eq.(1) becomes

y (p 0) = \sum p n \in R w (p n) \cdot x (p 0 + p n + Δ p n) . (2)

$\begin{equation} y(p_0) = \sum _{p_n\in R} w(p_n)\cdot x(p_0+p_n+\Delta p_n)\tag{2}. \end{equation}$

Now, the sampling is on the irregular and offset locations $p_n+\Delta p_n$ . As the offset $\Delta p_n$ is typically fractional, Eq.(2) is implemented via bilinear interpolation as

x (p) = \sum q G (q, p) \cdot x (q), (3)

$\begin{equation} x(p) = \sum _q G(q,p)\cdot x(q)\tag{3}, \end{equation}$
where p denotes an arbitrary (fractional) location (

p=p0+pn+Δpnp=p0+pn+Δpn $p = p_0+p_n+\Delta p_n$ for Eq.(2)), q enumerates all integral spacial locations in the future map x, and

G(⋅,⋅)G(⋅,⋅) $G(\cdot,\cdot)$ is the bilinear interpolation kernel. Note that G is two dimensional. It is separated into two one dimensional kernels as

G (q, p) = g (q x, p x) \cdot g (q y, p y), (4)

$\begin{equation} G(q,p) = g(q_x,p_x)\cdot g(q_y,p_y)\tag{4}, \end{equation}$
where

g(a,b)=max(0,1−|a−b|)g(a,b)=max(0,1−|a−b|) $g(a,b) = max(0,1 - |a - b|)$ . Eq.(3) is fast to compute as

G(q,p)G(q,p) $G(q,p)$ is non-zero only for a few

qq $q$ s.

Deformable RoI pooling

Deformable RoI pooling adds an offset to each bin position in regular bin partition of the previous RoI pooling. Similiarly, the offsets are learned from the preceding feature maps and RoIs, enabling adaptive part localization for objects with different shapes.

RoI pooling is used in all region proposal based object detection methods. It coverts an input rectangular region of arbitrary size into fixed size feature.
RoI Pooling Given the input feature x and a RoI of size w x h and top-left corner $P_0$ , RoI pooling divides the RoI into k x k(k is a free parameter) bins and outputs a k x k feature map y. For (i,j)-th bin $(0<=i,j<k),$ we have

y (i, j) = \sum p \in b i n (i, j) x (p 0 + p) / n i j, (5)

$\begin{equation} y(i,j) = \sum_{p\in bin(i,j)}x(p_0+p)/n_{ij}\tag{5}, \end{equation}$
where

nijnij $n_{ij}$ is the number of pixels in the bin.The (i,j)-th bin spans

⌊iwk⌋≤px<⌈(i+1)wk⌉and⌊jhk⌋≤py<⌊iwk⌋≤px<⌈(i+1)wk⌉and⌊jhk⌋≤py< $\lfloor i \frac{w}{k} \rfloor \leq p_x < \lceil (i+1) \frac{w}{k} \rceil and \lfloor j \frac{h}{k} \rfloor \leq p_y <$