Tensorflow object detection API 源码阅读笔记：RFCN

最新推荐文章于 2021-10-23 09:08:47 发布

原创最新推荐文章于 2021-10-23 09:08:47 发布 · 2.5k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#TensorFlow #R-FCN #faster-r-cnn #深度学习 #目标检测

TensorFlow 专栏收录该内容

22 篇文章

订阅专栏

基于Faster R-CNN，本文深入探讨Tensorflow实现的RFCN目标检测算法，结合源码进行详细解读，并提及MXNet中可变形R-FCN的应用。

部署运行你感兴趣的模型镜像

有了前面Faster R-CNN的基础，RFCN就比较容易了。

"""object_detection/meta_architectures/rfcn_meta_arch.py
The R-FCN meta architecture is similar to Faster R-CNN and only differs in the
second stage. Hence this class inherits FasterRCNNMetaArch and overrides only
the `_predict_second_stage` method
"""

改动比较大的地方如下：
    box_predictions = self._rfcn_box_predictor.predict(
        box_classifier_features,
        num_predictions_per_location=1,
        scope=self.second_stage_box_predictor_scope,
        proposal_boxes=proposal_boxes_normalized)
只改动了这么一点的原因是代码实现和原始paper不是完全一致的，见作者的paper：3.4. Training and hyperparameter tuning。
另外注意，在faster r-cnn中，是先经过ROI poooling，然后进入第二阶段特征提取器。而在rfcn中，是先进行第二阶段特征提取，然后进入RfcnBoxPredictor。这正是frcn改进的地方，即将卷积操作尽量在faster r-cnn的roi之间共享，使得rfcn得到的（更高层面的）roi需要单独经过的预测层更少，大大提高了效率。

"""先回顾一下faster rcnn: ROI pooling"""
实现代码在
  def _compute_second_stage_input_feature_maps(self, features_to_crop,
                                               proposal_boxes_normalized)
ROI就是features_to_crop（第一阶段特征提取器得到的feature map）上的一块crop，是依据proposal_boxed截取的。ROI pooling是spp layer的特殊情况，就是通过自适应大小的卷积将特征图映射到固定尺寸，以便进入fc层。tf代码作者采用的实现不是这种，而是将不同尺寸的ROI先变成统一的大小，然后就不需要进行ROI pooling了(见_compute_second_stage_input_feature_maps)，换言之，需要ROI POOLING的原因就是ROI的尺寸是不同的（RFCN PAPER的理解角度：this region-specific operation breaks down translation invariance, and the post-RoI convolutional layers are no longer translation-invariant when evaluated across different regions.）。

这里写图片描述

"""rfcn的基本想法上面分析对比过了，仔细看看rfcn paper中的三张大图就懂什么是position-sensitive了。为什么需要用position-sensitive RoI pooling呢？paper中说是为了将translation variance包含进FCN中。但是从position-sensitive score maps的生成来看，似乎类似显示地引入了一种“先验结构”，如上图的fig 3，怎么就能确保这九个score map刚好就是对应九个位置呢（For example, the “top-center-sensitive” score map exhibits high scores roughly near the top-center position of an object.）？
position-sensitive score maps
position-sensitive RoI pooling layer
"""

在Tensorflow object detection API中的实现有所不同：
class RfcnBoxPredictor(BoxPredictor)
Applies a position sensitve ROI pooling on position sensitive feature maps to
  predict classes and refined locations

ops.position_sensitive_crop_regions
  """Position-sensitive crop and pool rectangular regions from a feature grid.
  The output crops are split into `spatial_bins_y` vertical bins
  and `spatial_bins_x` horizontal bins. For each intersection of a vertical
  and a horizontal bin the output values are gathered by performing
  `tf.image.crop_and_resize` (bilinear resampling) on a a separate subset of
  channels of the image. This reduces `depth` by a factor of
  `(spatial_bins_y * spatial_bins_x)`.
  When global_pool is True, this function implements a differentiable version
  of position-sensitive RoI pooling used in
  [R-FCN detection system](https://arxiv.org/abs/1605.06409).
  When global_pool is False, this function implements a differentiable version
  of position-sensitive assembling operation used in
  [instance FCN](https://arxiv.org/abs/1603.08678)."""

'''
目测是将第二阶段提取器得到的feature用1*1卷积增加到k*k*(c+1)个channel，就得到了position-sensitive score maps:
      location_feature_map_depth = (self._num_spatial_bins[0] *
                                    self._num_spatial_bins[1] *
                                    self.num_classes *
                                    self._box_code_size)
      location_feature_map = slim.conv2d(net, location_feature_map_depth,
                                         [1, 1], activation_fn=None,
                                         scope='refined_locations')
然后就用了和faster rcnn代码类似的方法，进行position-sensitive RoI pooling。position-sensitive score maps这个地方确实类似一种“先验结构”。经常在paper中看到这种想法大胆的end-to-end先验结构。论文中说With end-to-end training, this RoI layer shepherds the last convolutional layer to learn specialized position-sensitive score maps，可视化的结果似乎确实有这种现象，有趣。
'''