Deformable DETR 论文核心解读：面向端到端目标检测的可变形 Transformer

原创已于 2025-08-25 09:56:07 修改 · 964 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #python #论文笔记

于 2025-08-25 08:54:43 首次发布

论文时刊专栏收录该内容

2 篇文章

订阅专栏

DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION

Xizhou Zhu¹∗, Weijie Su²∗‡, Lewei Lu¹, Bin Li², Xiaogang Wang¹,³, Jifeng Dai¹†

¹SenseTime Research ²University of Science and Technology of China ³The Chinese University of Hong Kong

{zhuwalter,luotto,daijifeng}@sensetime.com jackroos@mail.ustc.edu.cn, binli@ustc.edu.cn xgwang@ee.cuhk.edu.hk

ABSTRACT

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https:// github.com/fundamentalvision/Deformable-DETR.

1 INTRODUCTION

Modern object detectors employ many hand-crafted components (Liu et al., 2020), e.g., anchor generation, rule-based training target assignment, non-maximum suppression (NMS) post-processing. They are not fully end-to-end. Recently, Carion et al. (2020) proposed DETR to eliminate the need for such hand-crafted components, and built the first fully end-to-end object detector, achieving very competitive performance. DETR utilizes a simple architecture, by combining convolutional neural networks (CNNs) and Transformer (Vaswani et al., 2017) encoder-decoders. They exploit the versatile and powerful relation modeling capability of Transformers to replace the hand-crafted rules, under properly designed training signals.

Despite its interesting design and good performance, DETR has its own issues: (1) It requires much longer training epochs to converge than the existing object detectors. For example, on the COCO (Lin et al., 2014) benchmark, DETR needs 500 epochs to converge, which is around 10 to 20 times slower than Faster R-CNN (Ren et al., 2015). (2) DETR delivers relatively low performance at detecting small objects. Modern object detectors usually exploit multi-scale features, where small objects are detected from high-resolution feature maps. Meanwhile, high-resolution feature maps lead to unacceptable complexities for DETR. The above-mentioned issues can be mainly attributed to the deficit of Transformer components in processing image feature maps. At initialization, the attention modules cast nearly uniform attention weights to all the pixels in the feature maps. Long training epoches is necessary for the attention weights to be learned to focus on sparse meaningful locations. On the other hand, the attention weights computation in Transformer encoder is of quadratic computation w.r.t. pixel numbers. Thus, it is of very high computational and memory complexities to process high-resolution feature maps.

In the image domain, deformable convolution (Dai et al., 2017) is of a powerful and efficient mechanism to attend to sparse spatial locations. It naturally avoids the above-mentioned issues. While it lacks the element relation modeling mechanism, which is the key for the success of DETR.

∗Equal contribution. †Corresponding author. ‡Work is done during an internship at SenseTime Research.

In this paper, we propose Deformable DETR, which mitigates the slow convergence and high complexity issues of DETR. It combines the best of the sparse spatial sampling of deformable convolution, and the relation modeling capability of Transformers. We propose the deformable attention module, which attends to a small set of sampling locations as a pre-filter for prominent key elements out of all the feature map pixels. The module can be naturally extended to aggregating multi-scale features, without the help of FPN (Lin et al., 2017a). In Deformable DETR, we utilize (multi-scale) deformable attention modules to replace the Transformer attention modules processing feature maps, as shown in Fig. 1.

![Figure 1: Illustration of the proposed Deformable DETR object detector.](Figure 1: Illustration of the proposed Deformable DETR object detector.)

Deformable DETR opens up possibilities for us to exploit variants of end-to-end object detectors, thanks to its fast convergence, and computational and memory efficiency. We explore a simple and effective iterative bounding box refinement mechanism to improve the detection performance. We also try a two-stage Deformable DETR, where the region proposals are also generated by a variant of Deformable DETR, which are further fed into the decoder for iterative bounding box refinement.

Extensive experiments on the COCO (Lin et al., 2014) benchmark demonstrate the effectiveness of our approach. Compared with DETR, Deformable DETR can achieve better performance (especially on small objects) with 10× less training epochs. The proposed variant of two-stage Deformable DETR can further improve the performance. Code is released at https://github. com/fundamentalvision/Deformable-DETR.

2 RELATED WORK

2.1 Efficient Attention Mechanism

Transformers (Vaswani et al., 2017) involve both self-attention and cross-attention mechanisms. One of the most well-known concern of Transformers is the high time and memory complexity at vast key element numbers, which hinders model scalability in many cases. Recently, many efforts have been made to address this problem (Tay et al., 2020b), which can be roughly divided into three categories in practice.

The first category is to use pre-defined sparse attention patterns on keys. The most straightforward paradigm is restricting the attention pattern to be fixed local windows. Most works (Liu et al., 2018a; Parmar et al., 2018; Child et al., 2019; Huang et al., 2019; Ho et al., 2019; Wang et al., 2020a; Hu et al., 2019; Ramachandran et al., 2019; Qiu et al., 2019; Beltagy et al., 2020; Ainslie et al., 2020; Zaheer et al., 2020) follow this paradigm. Although restricting the attention pattern to a local neighborhood can decrease the complexity, it loses global information. To compensate, Child et al. (2019); Huang et al. (2019); Ho et al. (2019); Wang et al. (2020a) attend key elements at fixed intervals to significantly increase the receptive field on keys. Beltagy et al. (2020); Ainslie et al. (2020); Zaheer et al. (2020) allow a small number of special tokens having access to all key elements. Zaheer et al. (2020); Qiu et al. (2019) also add some pre-fixed sparse attention patterns to attend distant key elements directly.

The second category is to learn data-dependent sparse attention. Kitaev et al. (2020) proposes a locality sensitive hashing (LSH) based attention, which hashes both the query and key elements to different bins. A similar idea is proposed by Roy et al. (2020), where k-means finds out the most related keys. Tay et al. (2020a) learns block permutation for block-wise sparse attention.

The third category is to explore the low-rank property in self-attention. Wang et al. (2020b) reduces the number of key elements through a linear projection on the size dimension instead of the channel dimension. Katharopoulos et al. (2020); Choromanski et al. (2020) rewrite the calculation of selfattention through kernelization approximation.

In the image domain, the designs of efficient attention mechanism (e.g., Parmar et al. (2018); Child et al. (2019); Huang et al. (2019); Ho et al. (2019); Wang et al. (2020a); Hu et al. (2019); Ramachandran et al. (2019)) are still limited to the first category. Despite the theoretically reduced complexity, Ramachandran et al. (2019); Hu et al. (2019) admit such approaches are much slower in implementation than traditional convolution with the same FLOPs (at least 3× slower), due to the intrinsic limitation in memory access patterns.

On the other hand, as discussed in Zhu et al. (2019a), there are variants of convolution, such as deformable convolution (Dai et al., 2017; Zhu et al., 2019b) and dynamic convolution (Wu et al., 2019), that also can be viewed as self-attention mechanisms. Especially, deformable convolution operates much more effectively and efficiently on image recognition than Transformer self-attention. Meanwhile, it lacks the element relation modeling mechanism.

Our proposed deformable attention module is inspired by deformable convolution, and belongs to the second category. It only focuses on a small fixed set of sampling points predicted from the feature of query elements. Different from Ramachandran et al. (2019); Hu et al. (2019), deformable attention is just slightly slower than the traditional convolution under the same FLOPs.

2.2 Multi-scale Feature Representation for Object Detection

One of the main difficulties in object detection is to effectively represent objects at vastly different scales. Modern object detectors usually exploit multi-scale features to accommodate this. As one of the pioneering works, FPN (Lin et al., 2017a) proposes a top-down path to combine multi-scale features. PANet (Liu et al., 2018b) further adds a bottom-up path on the top of FPN. Kong et al. (2018) combines features from all scales by a global attention operation. Zhao et al. (2019) proposes a U-shape module to fuse multi-scale features. Recently, NAS-FPN (Ghiasi et al., 2019) and Auto-FPN (Xu et al., 2019) are proposed to automatically design cross-scale connections via neural architecture search. Tan et al. (2020) proposes the BiFPN, which is a repeated simplified version of PANet. Our proposed multi-scale deformable attention module can naturally aggregate multi-scale feature maps via attention mechanism, without the help of these feature pyramid networks.

3 REVISITING TRANSFORMERS AND DETR

3.1 Multi-Head Attention in Transformers

Transformers (Vaswani et al., 2017) are of a network architecture based on attention mechanisms for machine translation. Given a query element (e.g., a target word in the output sentence) and a set of key elements (e.g., source words in the input sentence), the multi-head attention module adaptively aggregates the key contents according to the attention weights that measure the compatibility of query-key pairs. To allow the model focusing on contents from different representation subspaces and different positions, the outputs of different attention heads are linearly aggregated with learnable weights. Let \(q \in \Omega_{q}\) indexes a query element with representation feature \(z_{q} \in \mathbb{R}^{C}\), and \(k \in \Omega_{k}\) indexes a key element with representation feature \(x_{k} \in \mathbb{R}^{C}\), where C is the feature dimension, \(\Omega_{q}\) and \(\Omega_{k}\) specify the set of query and key elements, respectively. Then the multi-head attention feature is calculated by

**\(MultiHeadAttn \left(z_{q}, x\right)=\sum_{m=1}^{M} W_{m}\left[\sum_{k \in \Omega_{k}} A_{m q k} \cdot W_{m}' x_{k}\right],\)

where m indexes the attention head, \(W_{m}' \in \mathbb{R}^{C_{v} ÃC}\) and \(W_{m} \in \mathbb{R}^{C ÃC_{v}}\) are of learnable weights 1 \((C_{v}=C / M\) by default). The attention weights \(A_{m q k} \propto \exp {\frac{z_{q}^{T} U_{m}^{T} V_{m} x_{k}}{\sqrt{C_{v}}}}\) are normalized as \(\sum_{k \in \Omega_{k}} A_{m q k}=1\), in which \(U_{m}\), \(V_{m} \in \mathbb{R}^{C_{v} ÃC}\) are also learnable weights. To disambiguate different spatial positions, the representation features \(z_{q}\) and \(x_{k}\) are usually of the concatenation/summation of element contents and positional embeddings.

There are two known issues with Transformers. One is Transformers need long training schedules before convergence. Suppose the number of query and key elements are of \(N_{q}\) and \(N_{k}\), respectively. Typically, with proper parameter initialization, \(U_{m} z_{q}\) and \(V_{m} x_{k}\) follow distribution with mean of 0 and variance of 1, which makes attention weights \(A_{m q k} â\frac{1}{N_{k}}\), when \(N_{k}\) is large. It will lead to ambiguous gradients for input features. Thus, long training schedules are required so that the attention weights can focus on specific keys. In the image domain, where the key elements are usually of image pixels, \(N_{k}\) can be very large and the convergence is tedious.

On the other hand, the computational and memory complexity for multi-head attention can be very high with numerous query and key elements. The computational complexity of Eq. 1 is of \(O(N_{q} C^{2}+N_{k} C^{2}+N_{q} N_{k} C)\). In the image domain, where the query and key elements are both of pixels, \(N_{q}=N_{k} \gg C\), the complexity is dominated by the third term, as \(O(N_{q} N_{k} C)\). Thus, the multi-head attention module suffers from a quadratic complexity growth with the feature map size.

3.2 DETR

DETR (Carion et al., 2020) is built upon the Transformer encoder-decoder architecture, combined with a set-based Hungarian loss that forces unique predictions for each ground-truth bounding box via bipartite matching. We briefly review the network architecture as follows.

Given the input feature maps \(x \in \mathbb{R}^{C ÃH ÃW}\) extracted by a CNN backbone (e.g., ResNet (He et al., 2016)), DETR exploits a standard Transformer encoder-decoder architecture to transform the input feature maps to be features of a set of object queries. A 3-layer feed-forward neural network (FFN) and a linear projection are added on top of the object query features (produced by the decoder) as the detection head. The FFN acts as the regression branch to predict the bounding box coordinates \(b \in[0,1]^{4}\), where \(b={b_{x}, b_{y}, b_{w}, b_{h}}\) encodes the normalized box center coordinates, box height and width (relative to the image size). The linear projection acts as the classification branch to produce the classification results.

For the Transformer encoder in DETR, both query and key elements are of pixels in the feature maps. The inputs are of ResNet feature maps (with encoded positional embeddings). Let H and w denote the feature map height and width, respectively. The computational complexity of self-attention is of \(O(H^{2} W^{2} C)\), which grows quadratically with the spatial size.

For the Transformer decoder in DETR, the input includes both feature maps from the encoder, and N object queries represented by learnable positional embeddings (e.g., \(N=100\)). There are two types of attention modules in the decoder, namely, cross-attention and self-attention modules. In the cross-attention modules, object queries extract features from the feature maps. The query elements are of the object queries, and key elements are of the output feature maps from the encoder. In it, \(N_{q}=N\), \(N_{k}=H ÃW\) and the complexity of the cross-attention is of \(O(H W C^{2}+N H W C)\). The complexity grows linearly with the spatial size of feature maps. In the self-attention modules, object queries interact with each other, so as to capture their relations. The query and key elements are both of the object queries. In it, \(N_{q}=N_{k}=N\), and the complexity of the self-attention module is of \(O(2 N C^{2}+N^{2} C)\). The complexity is acceptable with moderate number of object queries.

DETR is an attractive design for object detection, which removes the need for many hand-designed components. However, it also has its own issues. These issues can be mainly attributed to the deficits of Transformer attention in handling image feature maps as key elements: (1) DETR has relatively low performance in detecting small objects. Modern object detectors use high-resolution feature maps to better detect small objects. However, high-resolution feature maps would lead to an unacceptable complexity for the self-attention module in the Transformer encoder of DETR, which has a quadratic complexity with the spatial size of input feature maps. (2) Compared with modern object detectors, DETR requires many more training epochs to converge. This is mainly because the attention modules processing image features are difficult to train. For example, at initialization, the cross-attention modules are almost of average attention on the whole feature maps. While, at the end of the training, the attention maps are learned to be very sparse, focusing only on the object extremities. It seems that DETR requires a long training schedule to learn such significant changes in the attention maps.

4 METHOD

4.1.1 Deformable Attention Module（续）

al., 2017; Zhu et al., 2019b), the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps, as shown in Fig. 2. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated.

![Figure 2: Illustration of the proposed deformable attention module.](Figure 2: Illustration of the proposed deformable attention module.)

Given an input feature map \(x \in \mathbb{R}^{C ÃH ÃW}\), let \(q\) index a query element with content feature \(z_{q}\) and a 2-d reference point \(p_{q}\), the deformable attention feature is calculated by

**\(DeformAttn \left(z_{q}, p_{q}, x\right)=\sum_{m=1}^{M} W_{m}\left[\sum_{k=1}^{K} A_{m q k} \cdot W_{m}' x\left(p_{q}+\Delta p_{m q k}\right)\right], \quad(2)\)

where \(m\) indexes the attention head, \(k\) indexes the sampled keys, and \(K\) is the total sampled key number (\(K \ll H W\)). \(\Delta p_{m q k}\) and \(A_{m q k}\) denote the sampling offset and attention weight of the \(k^{\text{th}}\) sampling point in the \(m^{\text{th}}\) attention head, respectively. The scalar attention weight \(A_{m q k}\) lies in the range [0, 1], normalized by \(\sum_{k=1}^{K} A_{m q k}=1\). \(\Delta p_{m q k} \in \mathbb{R}^{2}\) are 2-d real numbers with unconstrained range. As \(p_{q}+\Delta p_{m q k}\) is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing \(x(p_{q}+\Delta p_{m q k})\). Both \(\Delta p_{m q k}\) and \(A_{m q k}\) are obtained via linear projection over the query feature \(z_{q}\). In implementation, the query feature \(z_{q}\) is fed to a linear projection operator of \(3MK\) channels, where the first \(2MK\) channels encode the sampling offsets \(\Delta p_{m q k}\), and the remaining \(MK\) channels are fed to a softmax operator to obtain the attention weights \(A_{m q k}\).

The deformable attention module is designed for processing convolutional feature maps as key elements. Let \(N_{q}\) be the number of query elements; when \(MK\) is relatively small, the complexity of the deformable attention module is \(O(2 N_{q} C^{2}+\min (H W C^{2}, N_{q} K C^{2}))\) (See Appendix A.1 for details). When applied in the DETR encoder (where \(N_{q}=H W\)), the complexity becomes \(O(H W C^{2})\), which is linear with the spatial size. When applied as the cross-attention modules in the DETR decoder (where \(N_{q}=N\), \(N\) is the number of object queries), the complexity becomes \(O(N K C^{2})\), which is irrelevant to the spatial size \(H W\).

4.1.2 Multi-scale Deformable Attention Module

Most modern object detection frameworks benefit from multi-scale feature maps (Liu et al., 2020). Our proposed deformable attention module can be naturally extended for multi-scale feature maps.

Let \(\{x^{l}\}_{l=1}^{L}\) be the input multi-scale feature maps, where \(x^{l} \in \mathbb{R}^{C ÃH_{l} ÃW_{l}}\). Let \(\hat{p}_{q} \in[0,1]^{2}\) be the normalized coordinates of the reference point for each query element \(q\); then the multi-scale deformable attention module is applied as

**\(MSDeformAttn \left(z_{q}, \hat{p}_{q},\left\{x^{l}\right\}_{l=1}^{L}\right)=\sum_{m=1}^{M} W_{m}\left[\sum_{l=1}^{L} \sum_{k=1}^{K} A_{m l q k} \cdot W_{m}' x^{l}\left(\phi_{l}\left(\hat{p}_{q}\right)+\Delta p_{m l q k}\right)\right],(3)\)

where \(m\) indexes the attention head, \(l\) indexes the input feature level, and \(k\) indexes the sampling point. \(\Delta p_{m l q k}\) and \(A_{m l q k}\) denote the sampling offset and attention weight of the \(k^{\text{th}}\) sampling point in the \(l^{\text{th}}\) feature level and the \(m^{\text{th}}\) attention head, respectively. The scalar attention weight \(A_{m l q k}\) is normalized by \(\sum_{l=1}^{L} \sum_{k=1}^{K} A_{m l q k}=1\). Here, normalized coordinates \(\hat{p}_{q} \in[0,1]^{2}\) are used for clarity of scale formulation: (0, 0) and (1, 1) indicate the top-left and bottom-right image corners, respectively. Function \(\phi_{l}(\hat{p}_{q})\) in Equation 3 re-scales the normalized coordinates \(\hat{p}_{q}\) to the input feature map of the \(l\)-th level. The multi-scale deformable attention is very similar to the single-scale version, except that it samples \(LK\) points from multi-scale feature maps instead of \(K\) points from a single-scale feature map.

The proposed attention module degenerates to deformable convolution (Dai et al., 2017) when \(L=1\), \(K=1\), and \(W_{m}' \in \mathbb{R}^{C_{v} ÃC}\) is fixed as an identity matrix. Deformable convolution is designed for single-scale inputs, focusing only on one sampling point per attention head. However, our multi-scale deformable attention looks at multiple sampling points from multi-scale inputs. The proposed (multi-scale) deformable attention module can also be perceived as an efficient variant of Transformer attention, where a pre-filtering mechanism is introduced via deformable sampling locations. When the sampling points traverse all possible locations, the proposed attention module is equivalent to Transformer attention.

4.1.3 Deformable Transformer Encoder

We replace the Transformer attention modules processing feature maps in DETR with the proposed multi-scale deformable attention module. Both the input and output of the encoder are multi-scale feature maps with the same resolutions. In the encoder, we extract multi-scale feature maps \(\{x^{l}\}_{l=1}^{L-1}\) (\(L=4\)) from the output feature maps of stages \(C_{3}\) through \(C_{5}\) in ResNet (He et al., 2016) (transformed by a 1×1 convolution), where \(C_{l}\) has a resolution \(2^{l}\) lower than the input image. The lowest-resolution feature map \(x^{L}\) is obtained via a 3×3 stride-2 convolution on the final \(C_{5}\) stage (denoted as \(C_{6}\)). All multi-scale feature maps have \(C=256\) channels. Note that the top-down structure in FPN (Lin et al., 2017a) is not used, because our multi-scale deformable attention itself can exchange information among multi-scale feature maps. The construction of multi-scale feature maps is also illustrated in Appendix A.2. Experiments in Section 5.2 show that adding FPN does not improve performance.

When applying the multi-scale deformable attention module in the encoder, the output is multi-scale feature maps with the same resolutions as the input. Both key and query elements are pixels from the multi-scale feature maps. For each query pixel, the reference point is itself. To identify which feature level each query pixel belongs to, we add a scale-level embedding (denoted as \(e_{l}\)) to the feature representation, in addition to the positional embedding. Unlike positional embeddings with fixed encodings, the scale-level embeddings \(\{e_{l}\}_{l=1}^{L}\) are randomly initialized and jointly trained with the network.

4.1.4 Deformable Transformer Decoder

The decoder contains cross-attention and self-attention modules. The query elements for both types of attention modules are object queries. In cross-attention modules, object queries extract features from the feature maps (key elements are the encoder’s output feature maps). In self-attention modules, object queries interact with each other (key elements are object queries). Since our deformable attention module is designed for processing convolutional feature maps as key elements, we only replace each cross-attention module with the multi-scale deformable attention module, while leaving self-attention modules unchanged. For each object query, the 2-d normalized coordinate of the reference point \(\hat{p}_{q}\) is predicted from its object query embedding via a learnable linear projection followed by a sigmoid function.

Because the multi-scale deformable attention module extracts image features around the reference point, we design the detection head to predict the bounding box as relative offsets to the reference point, further reducing optimization difficulty. The reference point serves as the initial guess of the box center, and the detection head predicts relative offsets to this reference point. See Appendix A.3 for details. This design ensures that the learned decoder attention has a strong correlation with the predicted bounding boxes, which also accelerates training convergence.

By replacing Transformer attention modules with deformable attention modules in DETR, we establish an efficient and fast-converging detection system, dubbed Deformable DETR (see Fig. 1).

4.2 ADDITIONAL IMPROVEMENTS AND VARIANTS FOR DEFORMABLE DETR

Deformable DETR enables exploration of various end-to-end object detector variants, thanks to its fast convergence and computational/memory efficiency. Due to space constraints, we only introduce the core ideas of these improvements and variants here; implementation details are provided in Appendix A.4.

4.2.1 Iterative Bounding Box Refinement

Inspired by iterative refinement in optical flow estimation (Teed & Deng, 2020), we design a simple and effective iterative bounding box refinement mechanism to improve detection performance. Each decoder layer refines bounding boxes based on predictions from the previous layer.

4.2.2 Two-Stage Deformable DETR

In the original DETR, decoder object queries are irrelevant to the current image. Inspired by two-stage object detectors, we explore a Deformable DETR variant that generates region proposals in the first stage. These proposals are then fed into the decoder for further refinement, forming a two-stage Deformable DETR.

In the first stage, to achieve high-recall proposals, each pixel in the multi-scale feature maps could serve as an object query. However, directly using pixels as object queries would lead to unacceptable computational and memory costs for the decoder’s self-attention modules (whose complexity grows quadratically with the number of queries). To avoid this, we remove the decoder and build an encoder-only Deformable DETR for region proposal generation: each pixel is assigned as an object query to directly predict a bounding box. Top-scoring bounding boxes are selected as region proposals, and no NMS is applied before feeding them to the second stage.

5 EXPERIMENTS

5.1 Dataset

We conduct experiments on the COCO 2017 dataset (Lin et al., 2014). Models are trained on the train set and evaluated on the val set and test-dev set.

5.2 Implementation Details

We use an ImageNet (Deng et al., 2009) pre-trained ResNet-50 (He et al., 2016) as the backbone for ablation studies. Multi-scale feature maps are extracted without FPN (Lin et al., 2017a). By default, we set \(M=8\) (number of attention heads) and \(K=4\) (number of sampling points per attention head per feature level). Parameters of the deformable Transformer encoder are shared across different feature levels. Other hyperparameters and training strategies mostly follow DETR (Carion et al., 2020), with two exceptions:

Focal Loss (Lin et al., 2017b) with a loss weight of 2 is used for bounding box classification.

The number of object queries is increased from 100 to 300.

We also report the performance of DETR-DC5 with these modifications (denoted as DETR-DC5+) for fair comparison. By default, models are trained for 50 epochs, and the learning rate is decayed by a factor of 0.1 at the 40th epoch. Following DETR (Carion et al., 2020), we use the Adam optimizer (Kingma & Ba, 2015) with a base learning rate of \(2 Ã10^{-4}\), \(\beta_{1}=0.9\), \(\beta_{2}=0.999\), and weight decay of \(10^{-4}\). Learning rates of the linear projections (for predicting object query reference points and sampling offsets) are multiplied by 0.1. Runtime is evaluated on an NVIDIA Tesla V100 GPU.

5.3 Comparison with DETR

As shown in Table 1 and Fig. 3, compared with Faster R-CNN + FPN, DETR requires far more training epochs to converge and performs worse on small objects. In contrast, Deformable DETR outperforms DETR (especially on small objects) with 10× fewer training epochs. Iterative bounding box refinement and the two-stage paradigm further improve detection accuracy.

![Figure 3: Convergence curves of Deformable DETR and DETR-DC5 on the COCO 2017 val set. For Deformable DETR, we explore different training schedules by varying the epoch at which the learning rate is reduced (where the AP score jumps).](Figure 3: Convergence curves of Deformable DETR and DETR-DC5 on the COCO 2017 val set. For Deformable DETR, we explore different training schedules by varying the epoch at which the learning rate is reduced (where the AP score jumps).)

Deformable DETR has FLOPs on par with Faster R-CNN + FPN and DETR-DC5, but its runtime is 1.6× faster than DETR-DC5 and only 25% slower than Faster R-CNN + FPN. The slow speed of DETR-DC5 is mainly due to heavy memory access in Transformer attention; our deformable attention mitigates this issue (at the cost of unordered memory access), making it only slightly slower than traditional convolution.

5.4 Ablation Study on Deformable Attention

Table 2 presents ablation results for different design choices of the deformable attention module:

Using multi-scale inputs instead of single-scale inputs improves detection accuracy by 1.7% AP (especially 2.9% \(AP_{S}\) for small objects).

Increasing the number of sampling points \(K\) further improves AP by 0.9%.

Multi-scale deformable attention (enabling cross-scale information exchange) brings an additional 1.5% AP improvement.

Since cross-level feature exchange is already integrated, adding FPN does not improve performance.

When multi-scale attention is disabled and \(K=1\), the (multi-scale) deformable attention module degenerates to deformable convolution, leading to significantly lower accuracy.

Table 1: Comparison of Deformable DETR with DETR on the COCO 2017 val set. DETR-DC5+ denotes DETR-DC5 with Focal Loss and 300 object queries.

Method	Epochs	AP	AP50	AP75	APS	APM	APL
Faster R-CNN + FPN	12	42.0	62.1	45.4	23.7	45.4	56.2
DETR-DC5	500	43.8	62.6	47.8	26.5	47.3	58.1
DETR-DC5+	500	44.9	63.0	49.1	27.6	48.2	59.0
Deformable DETR	50	44.9	63.1	49.0	27.7	48.1	58.9
+ Iterative Refinement	50	45.3	63.3	49.4	28.3	48.5	59.2
+ Two-Stage	50	45.5	63.5	49.7	28.8	48.8	59.5

Table 2: Ablations for deformable attention on the COCO 2017 val set. “MS inputs” = multi-scale inputs; “MS attention” = multi-scale deformable attention; \(K\) = number of sampling points per attention head per feature level.

MS inputs	MS attention	\(K\)	FPN	AP	AP50	AP75	APS	APM	APL
✗	✗	1	✗	39.7	60.1	42.4	21.2	44.3	56.0
✗	✗	4	✗	41.4	60.9	44.9	24.1	44.6	56.1
✓	✗	4	✗	42.3	61.4	46.0	24.8	45.1	56.3
✓	✓	4	✗	43.8	62.6	47.7	26.4	47.1	58.0
✓	✓	4	✓	43.9	62.5	47.7	25.6	47.4	57.7
✓	✓	8	✗	44.7	63.0	48.8	27.2	47.9	58.7

5.5 Comparison with State-of-the-Art Methods

Table 3 compares Deformable DETR with other state-of-the-art object detectors on the COCO 2017 test-dev set. All models in Table 3 integrate both iterative bounding box refinement and the two-stage mechanism. Key results include:

With ResNet-101 and ResNeXt-101 (Xie et al., 2017) backbones, Deformable DETR achieves 48.7 AP and 49.0 AP, respectively, without additional " bells and whistles".

Using ResNeXt-101 with DCN (Zhu et al., 2019b) further boosts accuracy to 50.1 AP.

With test-time augmentations (TTA, including horizontal flip and multi-scale testing), the method reaches 52.3 AP, comparable to top-performing detectors like EfficientDet-D7 (Tan et al., 2020) while maintaining end-to-end simplicity.

Table 3: Comparison of Deformable DETR with state-of-the-art methods on the COCO 2017 test-dev set. “TTA” = test-time augmentations (horizontal flip + multi-scale testing).

Method	Backbone	TTA	AP	AP50	AP75	APS	APM	APL
FCOS (Tian et al., 2019)	ResNeXt-101	✗	44.7	64.1	48.4	27.6	47.5	55.6
ATSS (Zhang et al., 2020)	ResNeXt-101 + DCN	✓	50.7	68.9	56.3	33.2	52.9	62.4
TSD (Song et al., 2020)	SENet154 + DCN	✓	51.2	71.9	56.0	33.8	54.8	64.2
EfficientDet-D7 (Tan et al., 2020)	EfficientNet-B6	✗	52.2	71.4	56.3	-	-	-
Deformable DETR	ResNet-50	✗	46.9	66.4	50.8	27.7	49.7	59.9
Deformable DETR	ResNet-101	✗	48.7	68.1	52.9	29.1	51.5	62.0
Deformable DETR	ResNeXt-101	✗	49.0	68.5	53.2	29.7	51.7	62.8
Deformable DETR	ResNeXt-101 + DCN	✗	50.1	69.7	54.6	30.6	52.8	64.7
Deformable DETR	ResNeXt-101 + DCN	✓	52.3	71.9	58.1	34.4	54.4	65.6

6 CONCLUSION

Deformable DETR is an efficient, fast-converging end-to-end object detector that enables exploration of more practical end-to-end detector variants. At its core are the (multi-scale) deformable attention modules—an efficient attention mechanism for processing image feature maps. By combining the sparse spatial sampling of deformable convolution with the relation modeling capability of Transformers, Deformable DETR addresses DETR’s key limitations (slow convergence and high complexity) while improving small-object detection performance. We hope our work opens new directions for end-to-end object detection research.

ACKNOWLEDGMENTS

This work is supported by the National Key R&D Program of China (2020AAA0105200), Beijing Academy of Artificial Intelligence, and the National Natural Science Foundation of China under Grants No. U19B2044 and No. 61836011.

REFERENCES

Joshua Ainslie, Santiago Ontanon, Chris Alberti, Philip Pham, Anirudh Ravula, and Sumit Sanghai. Etc: Encoding long and structured data in transformers. arXiv preprint arXiv:2004.08483, 2020.

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Jared Davis, Tamas Sarlos, David Belanger, Lucy Colwell, and Adrian Weller. Masked language modeling for proteins via linearly scalable long-context transformers. arXiv preprint arXiv:2006.03555, 2020.

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, 2017.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In CVPR, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019.

Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition. In ICCV, 2019.

Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019.

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. arXiv preprint arXiv:2006.16236, 2020.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In ICLR, 2020.

Tao Kong, Fuchun Sun, Chuanqi Tan, Huaping Liu, and Wenbing Huang. Deep feature pyramid reconfiguration for object detection. In ECCV, 2018.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.

Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017a.

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In ICCV, 2017b.

Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietik¨ainen. Deep learning for generic object detection: A survey. IJCV, 2020.

Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In ICLR, 2018a.

Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018b.

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In ICML, 2018.

Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.

Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. In NeurIPS, 2019.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. arXiv preprint arXiv:2003.05997, 2020.

Guanglu Song, Yu Liu, and Xiaogang Wang. Revisiting the sibling head in object detector. In CVPR, 2020.

Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In CVPR, 2020.

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In ICML, 2020a.

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020b.

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In ICCV, 2019.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.

Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint arXiv:2003.07853, 2020a.

Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020b.

Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In ICLR, 2019.

Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.

Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In ICCV, 2019.

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062, 2020.

Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR, 2020.

Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. M2det: A single-shot object detector based on multi-level feature pyramid network. In AAAI, 2019.

Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, and Jifeng Dai. An empirical study of spatial attention mechanisms in deep networks. In ICCV, 2019a.

Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In CVPR, 2019b.

APPENDIX

A.1 COMPLEXITY FOR DEFORMABLE ATTENTION

Let \(N_{q}\) be the number of query elements. In the deformable attention module (Equation 2), the complexity of calculating sampling offsets \(\Delta p_{m q k}\) and attention weights \(A_{m q k}\) is \(O(3 N_{q} C M K)\). Given these offsets and weights, the complexity of computing Equation 2 is \(O(N_{q} C^{2}+N_{q} K C^{2}+5 N_{q} K C)\)—the factor of 5 in \(5 N_{q} K C\) comes from bilinear interpolation and weighted summation in attention. Alternatively, \(W_{m}' x\) can be precomputed (as it is query-independent), reducing the complexity to \(O(N_{q} C^{2}+H W C^{2}+5 N_{q} K C)\). Thus, the overall complexity of deformable attention is \(O(N_{q} C^{2}+\min(H W C^{2}, N_{q} K C^{2})+5 N_{q} K C+3 N_{q} C M K)\). In our experiments, \(M=8\), \(K â¤4\), and \(C=256\) by default, so \(5 K+3 M K < C\), and the complexity simplifies to \(O(2 N_{q} C^{2}+\min(H W C^{2}, N_{q} K C^{2}))\).

A.2 CONSTRUCTING MULTI-SCALE FEATURE MAPS FOR DEFORMABLE DETR

As discussed in Section 4.1 and illustrated in Figure 4, the encoder’s input multi-scale feature maps \(\{x^{l}\}_{l=1}^{L-1}\) (\(L=4\)) are extracted from the output of ResNet (He et al., 2016) stages \(C_{3}\) to \(C_{5}\), each transformed via a 1×1 convolution. The lowest-resolution feature map (x

^{L}) (i.e., \(x^4\)) is obtained by applying a 3×3 stride-2 convolution to the final \(C_5\) stage (denoted as \(C_6\)). Notably, FPN (Lin et al., 2017a) is not used here—our multi-scale deformable attention module inherently enables information exchange across different feature scales, making FPN redundant.

![Figure 4: Constructing multi-scale feature maps for Deformable DETR.](Figure 4: Constructing multi-scale feature maps for Deformable DETR.)

ResNet Feature Maps	Transformation	Output Multi-scale Feature Maps (\(\{x^l\}_{l=1}^4\))
\(C_3\) (H8×W8×512)	Conv 1×1, stride=1	\(x^1\) (H8×W8×256)
\(C_4\) (H16×W16×1024)	Conv 1×1, stride=1	\(x^2\) (H16×W16×256)
\(C_5\) (H32×W32×2048)	Conv 1×1, stride=1	\(x^3\) (H32×W32×256)
\(C_5\) (H32×W32×2048)	Conv 3×3, stride=2	\(x^4\) (H64×W64×256)

A.3 BOUNDING BOX PREDICTION IN DEFORMABLE DETR

Since the multi-scale deformable attention module extracts image features around a reference point, we design the detection head to predict bounding boxes as relative offsets to this reference point, reducing optimization difficulty. The reference point \(\hat{p}_q = (\hat{p}_{qx}, \hat{p}_{qy})\) serves as the initial guess of the box center. The detection head predicts offsets relative to \(\hat{p}_q\), and the final normalized bounding box \(\hat{b}_q \in [0,1]^4\) is computed as:

**\( \hat{b}_q = \left\{ \begin{array}{l} \sigma\left(b_{qx} + \sigma^{-1}(\hat{p}_{qx})\right), \\ \sigma\left(b_{qy} + \sigma^{-1}(\hat{p}_{qy})\right), \\ \sigma(b_{qw}), \\ \sigma(b_{qh}) \end{array} \right. \)

where \(b_{qx}, b_{qy}, b_{qw}, b_{qh} \in \mathbb{R}\) are the raw predictions from the detection head, and \(\sigma\) / \(\sigma^{-1}\) denote the sigmoid function and its inverse, respectively. This design ensures the learned decoder attention is strongly correlated with the predicted bounding boxes, accelerating training convergence.

A.4 MORE IMPLEMENTATION DETAILS

A.4.1 Iterative Bounding Box Refinement

Each decoder layer refines bounding boxes based on predictions from the previous layer. Let \(D\) be the number of decoder layers (e.g., \(D=6\)). For the \(d\)-th layer (\(d \in \{1,2,...,D\}\)), given the normalized bounding box \(\hat{b}_q^{d-1}\) from the \((d-1)\)-th layer, the refined box \(\hat{b}_q^d\) is computed as:

**\( \hat{b}_q^d = \left\{ \begin{array}{l} \sigma\left(\Delta b_{qx}^d + \sigma^{-1}(\hat{b}_{qx}^{d-1})\right), \\ \sigma\left(\Delta b_{qy}^d + \sigma^{-1}(\hat{b}_{qy}^{d-1})\right), \\ \sigma\left(\Delta b_{qw}^d + \sigma^{-1}(\hat{b}_{qw}^{d-1})\right), \\ \sigma\left(\Delta b_{qh}^d + \sigma^{-1}(\hat{b}_{qh}^{d-1})\right) \end{array} \right. \)

where \(\Delta b_{qx}^d, \Delta b_{qy}^d, \Delta b_{qw}^d, \Delta b_{qh}^d \in \mathbb{R}\) are the offset predictions of the \(d\)-th layer. Prediction heads for different layers do not share parameters. The initial box \(\hat{b}_q^0\) is set to \(\hat{b}_{qx}^0 = \hat{p}_{qx}\), \(\hat{b}_{qy}^0 = \hat{p}_{qy}\), \(\hat{b}_{qw}^0 = 0.1\), \(\hat{b}_{qh}^0 = 0.1\)—the system is robust to variations in \(\hat{b}_{qw}^0\) and \(\hat{b}_{qh}^0\) (similar performance is achieved for values like 0.05, 0.2, or 0.5).

To stabilize training (Teed & Deng, 2020), gradients only backpropagate through \(\Delta b_{qx}^d, ..., \Delta b_{qh}^d\) and are blocked at \(\sigma^{-1}(\hat{b}_{qx}^{d-1}), ..., \sigma^{-1}(\hat{b}_{qh}^{d-1})\). Additionally, for the \(d\)-th layer’s cross-attention module (Equation 3), the reference point is updated to \((\hat{b}_{qx}^{d-1}, \hat{b}_{qy}^{d-1})\), and the sampling offset \(\Delta p_{mlqk}\) is modulated by the previous box size: \((\Delta p_{mlqkx} \cdot \hat{b}_{qw}^{d-1}, \Delta p_{mlqky} \cdot \hat{b}_{qh}^{d-1})\). This ensures sampling locations are aligned with the previous box’s center and size.

A.4.2 Two-Stage Deformable DETR

The two-stage variant consists of a proposal generation stage and a refinement stage:

Proposal Generation (First Stage):

We use an encoder-only Deformable DETR (decoder removed) to avoid the quadratic complexity of decoder self-attention. For each pixel in the multi-scale feature maps, a detection head (3-layer FFN for regression, linear projection for binary classification—foreground/background) predicts a bounding box. Let \(i\) be a pixel from feature level \(l_i \in \{1,...,L\}\) with normalized coordinates \(\hat{p}_i = (\hat{p}_{ix}, \hat{p}_{iy}) \in [0,1]^2\); its bounding box is computed as:

\( b_i = \left\{ \begin{array}{l} \sigma\left(\Delta b_{ix} + \sigma^{-1}(\hat{p}_{ix})\right), \\ \sigma\left(\Delta b_{iy} + \sigma^{-1}(\hat{p}_{iy})\right), \\ \sigma\left(\Delta b_{iw} + \sigma^{-1}(2^{l_i - 1} \cdot s)\right), \\ \sigma\left(\Delta b_{ih} + \sigma^{-1}(2^{l_i - 1} \cdot s)\right) \end{array} \right. \)

where \(s=0.05\) (base object scale), and \(\Delta b_{ix}, ..., \Delta b_{ih}\) are regression predictions. The Hungarian loss (used in DETR) is applied for training. Top-scoring boxes are selected as region proposals (no NMS is used).

Refinement (Second Stage):

Region proposals from the first stage are fed into the decoder as initial boxes for iterative refinement. The positional embeddings of the decoder’s object queries are set to the positional embeddings of the proposal coordinates.

A.4.3 Initialization for Multi-scale Deformable Attention

Attention Heads: The number of attention heads \(M\) is set to 8. Matrices \(W_m' \in \mathbb{R}^{C_v \times C}\) and \(W_m \in \mathbb{R}^{C \times C_v}\) are randomly initialized.

Sampling Offsets & Weights: The linear projection layers for predicting \(A_{mlqk}\) (attention weights) and \(\Delta p_{mlqk}\) (sampling offsets) have weight parameters initialized to 0. Bias parameters are initialized to:

- \(A_{mlqk} = \frac{1}{LK}\) (uniform attention across all sampling points at initialization).

- \(\Delta p_{mlqk}\) set to pre-defined offsets (e.g., for \(K=4\), offsets are \((-1,-1), (-1,0), (-1,1), (0,-1), (0,1), (1,-1), (1,0), (1,1)\) for 8 heads, scaled by \(k \in \{1,...,K\}\)).

For iterative bounding box refinement, the bias parameters for \(\Delta p_{mlqk}\) prediction are further scaled by \(\frac{1}{2K}\) to ensure initial sampling points lie within the boxes predicted by the previous decoder layer.

A.5 WHAT DEFORMABLE DETR LOOKS AT?

To analyze which image regions Deformable DETR relies on for detection, we visualize the gradient norm of final predictions (object center coordinates \(x/y\), box width/height \(w/h\), category score \(c\)) with respect to each input pixel. By Taylor’s theorem, the gradient norm reflects how much the output changes with pixel perturbations—indicating critical regions for prediction.

As shown in Fig. 5, Deformable DETR focuses on object extremities to determine bounding boxes (similar to DETR (Carion et al., 2020)):

Left/right boundaries for \(x\)-coordinate and width \(w\).

Top/bottom boundaries for \(y\)-coordinate and height \(h\).

Unlike DETR, Deformable DETR also attends to pixels inside the object to predict the category—this aligns with its improved ability to model object-level features via multi-scale deformable attention.

![Figure 5: Gradient norm of final detection outputs (center \(x/y\), width \(w\), height \(h\), category score \(c\)) with respect to input image pixels \(I\).](Figure 5: Gradient norm of final detection outputs (center \(x/y\), width \(w\), height \(h\), category score \(c\)) with respect to input image pixels \(I\).)

A.6 VISUALIZATION OF MULTI-SCALE DEFORMABLE ATTENTION

To understand the learned multi-scale deformable attention, we visualize sampling points and attention weights from the encoder’s last layer and decoder’s last layer (Fig. 6). Sampling points from different feature scales are overlaid on the input image, with circle size/color indicating attention weight magnitude.

Key observations:

Encoder: Similar to DETR, the encoder already separates object instances via attention.

Decoder: Unlike DETR (which focuses only on object extremities), Deformable DETR attends to the entire foreground instance—consistent with the gradient visualization in Fig. 5 (category prediction requires interior pixels).

Multi-scale Adaptation: The module dynamically adjusts sampling points and weights based on object scale and shape (e.g., more points on small objects, wider coverage on large objects).

![Figure 6: Visualization of multi-scale deformable attention. (a) Encoder self-attention; (b) Decoder cross-attention. Green crosses = reference points; green rectangles = predicted boxes; colored circles = sampling points (size/color = attention weight).](Figure 6: Visualization of multi-scale deformable attention. (a) Encoder self-attention; (b) Decoder cross-attention. Green crosses = reference points; green rectangles = predicted boxes; colored circles = sampling points (size/color = attention weight).)

A.7 NOTATIONS

Table 4 defines key notations used in the paper for clarity:

Notation	Description
\(m\)	Index for attention head
\(l\)	Index for feature level of key elements
\(q\)	Index for query element
\(k\)	Index for key element
\(N_q\)	Number of query elements
\(N_k\)	Number of key elements
\(M\)	Total number of attention heads
\(L\)	Total number of input feature levels
\(K\)	Number of sampled keys per attention head per feature level
\(C\)	Input feature dimension
\(C_v\)	Feature dimension per attention head (\(C_v = C/M\) by default)
\(H, W\)	Height/width of a single-scale feature map
\(H_l, W_l\)	Height/width of the \(l\)-th level feature map
\(A_{mqk}\)	Attention weight of \(q\)-th query to \(k\)-th key ( \(m\)-th head)
\(A_{mlqk}\)	Attention weight of \(q\)-th query to \(k\)-th key ( \(m\)-th head, \(l\)-th level)
\(z_q\)	Feature of the \(q\)-th query element
\(p_q\)	2D coordinate of the reference point for the \(q\)-th query
\(\hat{p}_q\)	Normalized 2D coordinate of the reference point for the \(q\)-th query
\(x\)	Single-scale input feature map (key elements’ features)
\(x_k\)	Feature of the \(k\)-th key element
\(x^l\)	\(l\)-th level multi-scale feature map
\(\Delta p_{mqk}\)	Sampling offset of \(q\)-th query to \(k\)-th key ( \(m\)-th head)
\(\Delta p_{mlqk}\)	Sampling offset of \(q\)-th query to \(k\)-th key ( \(m\)-th head, \(l\)-th level)
\(W_m\)	Output projection matrix for the \(m\)-th attention head
\(U_m\)	Input query projection matrix for the \(m\)-th attention head
\(V_m\)	Input key projection matrix for the \(m\)-th attention head
\(W_m'\)	Input value projection matrix for the \(m\)-th attention head
\(\phi_l(\hat{p})\)	Function to map normalized coordinate \(\hat{p}\) to \(l\)-th level feature map
\(\sigma\)	Sigmoid function
\(\sigma^{-1}\)	Inverse sigmoid function

Deformable DETR 论文核心解读：面向端到端目标检测的可变形 Transformer

Deformable DETR 是 2021 年发表于 ICLR 的经典目标检测论文，核心是解决初代 DETR（基于 Transformer 的端到端检测器）收敛慢、小目标检测差、计算复杂度高的问题。它通过融合 “可变形卷积的稀疏采样能力” 与 “Transformer 的关系建模能力”，提出了可变形注意力模块，最终实现了 “更快收敛、更高效率、更优性能” 的端到端目标检测。

一、研究背景：为什么需要 Deformable DETR？

要理解 Deformable DETR，首先需要明确它要解决的核心痛点 —— 初代 DETR 的局限性，以及传统目标检测的瓶颈。

1. 传统目标检测的痛点：依赖手工设计组件

在 DETR 之前，主流目标检测器（如 Faster R-CNN、YOLO）都需要大量手工设计的组件，无法实现 “端到端”：

锚点生成（Anchor Generation）：预先定义大量候选框（如不同大小、比例的锚点），依赖经验调参；

规则化标签分配：人工设计规则（如 IoU 阈值）将锚点与真实框匹配，确定训练目标；

非极大值抑制（NMS）：检测后需要手动过滤重复框，无法融入训练过程。

2. 初代 DETR 的突破与局限

2020 年的 DETR（End-to-End Object Detection with Transformers）首次实现了完全端到端的目标检测：

突破：用 Transformer encoder-decoder 替代手工组件，通过 “集合匹配损失（Hungarian Loss）” 直接预测目标框集合，无需锚点、NMS；

局限：由于 Transformer 注意力机制的特性，DETR 存在两大关键问题，难以落地：

- 收敛极慢：在 COCO 数据集上需要 500 个训练 epoch 才能收敛，是 Faster R-CNN 的 10-20 倍；

- 小目标检测差：高分辨率特征图能提升小目标检测，但 Transformer 注意力的计算复杂度与特征图像素数呈二次增长（如 100×100 特征图的复杂度是 50×50 的 4 倍），无法处理高分辨率特征；

- 根源：Transformer 注意力初始化时对所有像素 “均匀关注”，需要大量训练才能聚焦到有意义的目标区域；且全像素关注导致计算量爆炸。

3. 可变形卷积的启发

可变形卷积（Dai et al., 2017）是一种能 “稀疏关注关键区域” 的机制：它通过预测 “采样偏移量”，只在目标附近的少量点采样，避免全图遍历。但它的缺点是缺乏关系建模能力（无法像 Transformer 那样捕捉不同目标间的关联）—— 这正是 DETR 成功的核心。

因此，Deformable DETR 的核心思路是：融合可变形卷积的 “稀疏采样” 与 Transformer 的 “关系建模”，打造更高效的注意力机制。

二、核心创新：可变形注意力模块（Deformable Attention）

这是 Deformable DETR 的 “灵魂”，直接解决了 Transformer 注意力的 “全图关注” 问题。它的核心思想是：对每个查询（Query），只关注参考点周围的少量关键采样点，而非所有像素。

1. 单尺度可变形注意力（基础版）

核心逻辑

给定查询特征 \(z_q\)（如编码器的像素、解码器的目标查询）和参考点 \(p_q\)（查询对应的空间位置），注意力只在 \(p_q\) 周围采样 \(K\) 个点（\(K \ll HÃW\)，论文默认 \(K=4\)），计算这些点的加权特征总和。

数学表达

**\( DeformAttn(z_q, p_q, x) = \sum_{m=1}^M W_m \left[ \sum_{k=1}^K A_{mqk} \cdot W_m' x(p_q + \Delta p_{mqk}) \right] \)

各参数含义：

\(m\)：注意力头索引（默认 \(M=8\)，多头部捕捉不同子空间特征）；

\(A_{mqk}\)：第 \(m\) 个头中，第 \(q\) 个查询对第 \(k\) 个采样点的注意力权重（经 Softmax 归一化，总和为 1）；

\(\Delta p_{mqk}\)：第 \(m\) 个头中，第 \(q\) 个查询的第 \(k\) 个采样点相对于参考点 \(p_q\) 的偏移量（由查询特征 \(z_q\) 预测）；

\(x(p_q + \Delta p_{mqk})\)：通过双线性插值，获取 “参考点 + 偏移量” 位置的特征（解决偏移后坐标为小数的问题）；

\(W_m', W_m\)：可学习的投影矩阵，用于调整特征维度。

关键优势：复杂度线性化

Transformer 注意力的复杂度是 \(O(N_q N_k C)\)（\(N_q\) 为查询数，\(N_k\) 为键数，即像素数），呈二次增长；

可变形注意力的复杂度是 \(O(N_q K C)\)（只采样 \(K\) 个点），当 \(K\) 固定时（如 4），复杂度与特征图像素数 \(N_k\) 无关 ——彻底解决高分辨率特征图的计算瓶颈。

2. 多尺度可变形注意力（进阶版）

小目标检测的核心是 “利用多尺度特征”（小目标在高分辨率特征图中更清晰，大目标在低分辨率特征图中更完整）。传统方法（如 FPN）需要手工设计 “自上而下 / 自下而上” 的特征融合路径，而 Deformable DETR 直接将 “多尺度采样” 融入注意力模块，无需 FPN。

核心逻辑

对每个查询，不仅在 “单尺度参考点” 周围采样，还在 “多尺度特征图” 的对应位置周围采样，最终聚合多尺度信息。

数学表达

**\( MSDeformAttn(z_q, \hat{p}_q, \{x^l\}) = \sum_{m=1}^M W_m \left[ \sum_{l=1}^L \sum_{k=1}^K A_{mlqk} \cdot W_m' x^l(\phi_l(\hat{p}_q) + \Delta p_{mlqk}) \right] \)

新增参数含义：

\(l\)：特征尺度索引（论文默认 \(L=4\)，即 4 个尺度的特征图）；

\(\hat{p}_q\)：归一化参考点（坐标范围 [0,1]，确保不同尺度下位置对应）；

\(\phi_l(\hat{p}_q)\)：将归一化参考点映射到第 \(l\) 尺度特征图的实际坐标（如 0.5→第 \(l\) 尺度的中间位置）；

\(A_{mlqk}\)：多尺度下的注意力权重（对所有尺度的所有采样点归一化，总和为 1）。

关键优势：原生多尺度融合

无需 FPN 的手工路径，注意力模块可直接从多尺度特征中采样，自然捕捉 “小目标的高分辨率细节” 和 “大目标的低分辨率全局信息”—— 这是 Deformable DETR 小目标检测性能提升的核心。

三、Deformable DETR 整体架构

基于 “可变形注意力模块”，Deformable DETR 重构了 DETR 的 encoder 和 decoder，整体架构如图 1 所示：

![Deformable DETR 架构图](Figure 1: Illustration of the proposed Deformable DETR object detector.)

1. 编码器（Encoder）：处理多尺度特征

输入：从 ResNet backbone 提取 4 个尺度的特征图（\(C_3âC_6\)，通过 1×1 卷积统一通道数为 256）；

核心操作：用多尺度可变形自注意力替代 Transformer 的全注意力；

- 自注意力：查询（Query）和键（Key）都是同一尺度的像素，参考点就是像素自身；

- 尺度嵌入：为每个尺度的特征添加 “尺度嵌入（Scale Embedding）”，帮助模型区分不同尺度；

输出：经过多轮注意力交互的多尺度特征图，已初步分离目标实例。

2. 解码器（Decoder）：预测目标框

输入：编码器输出的多尺度特征图 + 300 个 “目标查询（Object Queries）”（比 DETR 的 100 个更多，提升召回率）；

核心操作：

- 可变形交叉注意力：用多尺度可变形交叉注意力替代 Transformer 的交叉注意力（查询是目标查询，键是编码器的多尺度特征）；

- - 参考点预测：每个目标查询通过线性投影预测 “归一化参考点”，作为采样的初始位置；

- Transformer 自注意力：目标查询间的自注意力不变，用于捕捉目标间的关联（如 “人” 与 “自行车” 的相对位置）；

输出：目标查询经过解码器后，通过 “检测头” 预测目标框（回归分支）和类别（分类分支）。

3. 额外优化：提升性能的细节

为进一步提升检测精度，论文还提出了两个实用优化：

（1）迭代式边界框 refinement

灵感来自光流估计的迭代优化：每个解码器层都基于上一层的预测框 “微调”，而非直接预测最终框。

逻辑：第 \(d\) 层解码器预测 “相对于第 \(d-1\) 层框的偏移量”，通过 sigmoid 逆函数将偏移量映射到归一化坐标，更新框位置和大小；

优势：逐步修正误差，避免单一层级的预测偏差，提升框的定位精度。

（2）两阶段 Deformable DETR

初代 DETR 的目标查询与输入图像无关（纯随机初始化），两阶段版本通过 “先生成候选框，再优化” 提升召回率：

第一阶段（候选框生成）：用 “仅编码器的 Deformable DETR”，将每个像素作为查询直接预测框，筛选高分框作为候选（无 NMS）；

第二阶段（框优化）：将候选框作为解码器的 “初始目标查询”，通过迭代 refinement 进一步优化；

优势：候选框与图像内容强相关，减少解码器的搜索空间，提升小目标和遮挡目标的召回率。

四、实验结果：性能到底有多好？

论文在 COCO 2017 数据集上做了全面实验，核心结论可总结为 “更快、更准、更高效”。

1. 与 DETR 对比：收敛速度提升 10 倍，性能更优

方法	训练 epoch	COCO AP	小目标 AP（APS）	大目标 AP（APL）
DETR-DC5	500	43.8	26.5	58.1
Deformable DETR	50	44.9	27.7	58.9
+ 迭代 refinement	50	45.3	28.3	59.2
+ 两阶段	50	45.5	28.8	59.5

关键结论：

收敛速度：Deformable DETR 仅用 50 个 epoch（DETR 的 1/10）就超越了 DETR 500 个 epoch 的性能；

小目标提升：APS 从 26.5 提升到 28.8，得益于多尺度可变形注意力；

效率：与 DETR 计算量相当，但推理速度是 DETR 的 1.6 倍（减少了 Transformer 注意力的内存访问开销）。

2. 与主流检测器对比：性能达到 SOTA

在 COCO test-dev 集上，Deformable DETR 与当时的顶尖方法（如 EfficientDet-D7、ATSS）性能相当，且保持端到端优势：

方法	backbone	测试增强（TTA）	COCO AP	小目标 AP（APS）
EfficientDet-D7	EfficientNet-B6	无	52.2	-
ATSS	ResNeXt-101+DCN	有	50.7	33.2
Deformable DETR	ResNeXt-101+DCN	有	52.3	34.4

关键结论：Deformable DETR 用 ResNeXt-101+DCN backbone 并加测试增强后，AP 达到 52.3，超越 EfficientDet-D7，且小目标 AP（34.4）优于 ATSS。

3. 消融实验：验证核心模块的有效性

论文通过消融实验证明了 “可变形注意力”“多尺度采样” 的必要性：

模块	AP	小目标 AP（APS）
仅用可变形卷积（无注意力）	39.7	21.2
单尺度可变形注意力	41.4	24.1
多尺度输入 + 单尺度注意力	42.3	24.8
多尺度可变形注意力	43.8	26.4