【R-FCN】《R-FCN: Object Detection via Region-based Fully Convolutional Networks》_fully convolutional subnetwork before roi layer-优快云博客

本文链接：https://blog.youkuaiyun.com/bryant_meng/article/details/81945236

提出R-FCN，一种全卷积目标检测方法，利用位置敏感得分图提高检测速度和精度，达到2.5-20倍于Faster R-CNN的速度。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这里写图片描述

NIPS-2016

1 Motivation

Faster RCNN two stage ：fully convolutional subnetwork + RoI-wise subnetwork apply a costly per-region subnetwork hundreds of times.

为了进一步提速，作者用 fully convolutional with almost all computation shared on the entire image. 减少头部的计算时间，

但是呢，如果简单的把 fully connected（Alex、VGG）层换成 fully convolutional （ResNet、GoogleNet）会导致 inferior detection accuracy that does not match the network’s superior classification accuracy，这是因为 image-level classification task favors translation invariance 但是 object detection task needs localization representations that are translation-variant to an extent，会有冲突。

ResNet的做法是 creates a deeper RoI-wise subnetwork ，虽然提高了精度，但是速度有所下降 due to the unshared per-RoI computation.（越深的网络对位置信息越不敏感，所以 resnet 把 roi poolng 操作提前了，放在 conv 之间而不是最后一个 conv，这样头部就能保留更多的对位置敏感的信息，代价就是，头部计算量会增加，因为不共享计算——参数是共享的，但每个 roi 的计算是独立的，roi 是有重叠区域的，可以理解为重叠区域的计算是重复了）

本文的目的就要想办法解决这种冲突

2 Innovation

在faster RCNN 的基础上 propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.

3 Advantages

region-based detector is fully convolutional with almost all computation shared on the entire image
2.5-20× faster than the Faster R-CNN counterpart
83.6% mAP on the PASCAL VOC 2007，82.0% the 2012

4 Methods

faster rcnn 头部不是 shared。这里的头部是 shared

4.1 Structures

Backbone：ResNet101 = convolutional layers（100） + average pooling + 1000-class fc layer，作者只用100个convolutional layer来提取 feature map
Position-sensitive score maps & Position-sensitive RoI pooling

最后一层 convolutional layer 的 channel 设计为 $k^2（C+1）$ ，C是类别数量

我们把score maps 上的 RoI 分成 k*k 个bins
这里写图片描述

ROI pooling 操作如下，

$r_c(i, j)$ is the pooled response in the $(i, j) - t h$ bin for the $c - t h$ category
$\Theta$ denotes all learnable parameters of the network
$z_{i,j,c}$ is one of score map out of the $k^2(C+1)$ score maps
$x_0,y_0)$ denotes the top-left corner of an RoI
$n$ is the number of pixels in the bin

需要注意的是，pooling 后的第 i,j,c-th bin 是由第 i,j,c-th score map对应的第 i.j bin 累加得到 ，也即是各类中每个 score maps 只 pooling 某一个bins，比如白色的 score maps 只对右下角部分的像素进行操作。
这里写图片描述
再来一张
在这里插入图片描述

作者 address bbox regression in a similar way, feature map 通过 $4k^2$ channel 的 conv 产生 $4k^2$ channel 的 score maps，然后 producing a $4k^2$ -d vector for each RoI，接着 vote 成 4-d vector，表示 $t_x，t_y，t_w，t_h）$ ，这种情况是 class-agnostic，对于 class-specific 来说， $4k^2*C$ 即可。

4.2 train

End to End
Loss function，和 Fast RCNN 一样【Fast RCNN】《Fast-RCNN》

分类 Loss 和回归 Loss 通过一个参数 λ （论文中为1）加权平均， $L_{reg}$ 同样采用 L1 Smooth。
Positive：IoU at least 0.5，Negative otherwise
OHEM
Single scales training：shorter side of image is 600 pixels
4-steps 训练，和 Faster RCNN 一样【Faster RCNN】《Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks》，只不过是 RPN 和 R-FCN的交替训练

4.3 inference

300 RoI
non-maximum suppression：0.3

4.4 À trous and stride

ResNet 101 中， stage 1-4 unchanged，stage 5 中，把第一个 stride = 2 的conv 变成1，然后用 À trous to compensate for the reduced stride

这样 ResNet stride from 32 to 16，提高了 score maps 的分辨率

提升效果如下，2.6 个点

这里写图片描述

4.5 Visualization

如果每个bins 的响应都很高，vote 后的 score 会很高，反之，则 low
这里写图片描述

5 Experiments

5.1 Experiments on PASCAL VOC

5.1.1 验证结构的效果

training ：07+12
test：07
Standard Faster RCNN：76.4% mAP，RoI pooling 插入 stage4 和 stage5之间（见文章最后的 Faster RCNN-RES101 结构图）
naive Faster RCNN：RoI pooling after conv5，an inexpensive 21-class fc layer is evaluated on each RoI，为了对比的公平，用了 À trous trick
class-specific RPN：RPN 的二分类（object or not）换成了 21 类，也用了À trous trick
R-FCN（without position-sensitivity）k = 1，相当于 global pooling within each RoI