Faster R-CNN 学习笔记

最新推荐文章于 2024-04-21 15:46:09 发布

三分明月落

最新推荐文章于 2024-04-21 15:46:09 发布

阅读量851

点赞数

本文链接：https://blog.youkuaiyun.com/qq_40755643/article/details/89243690

版权

算法发展：

R-CNN：把2000个建议框，分别送入网络

Fast-RCNN：把图片送入网络中，再把2000个建议框映射到网络训练出来的feature map上

Faster-RCNN：利用RPN选取300建议框，加入ROI层，ROI pooling层能实现训练和测试的显著加速，并提高检测的正确率。

算法框架:

算法流程：

输入图片（224x224x3）
整张图输入CNN，Faster RCNN首先使用一组基础的conv+relu+pooling层提取input image的feature maps,该feature maps会用于后续的RPN层和全连接层
RPN：利用RPN网络生成建议框（Anchor box），每张图片生成300个建议框窗口（包括IoU和NMS），对这些建议框进行裁剪过滤（reshape）后通过softmax判断anchors属于前景(foreground)或者后景(background)，即是物体or不是物体，所以这是一个二分类；同时，另一分支bounding box regression修正anchor box，形成较精确的proposal
RoI pooling层：该层把RPN生成的proposals映射到VGG16最后一层得到的feature map，得到固定大小的proposal feature map,进入到后面可利用全连接操作来进行目标识别和定位
Classifier：将Roi Pooling层形成固定大小的feature map进行全连接操作，利用Softmax Loss进行具体类别的分类，同时，利用SmoothL1 Loss完成bounding box regression回归操作获得物体的精确位置

算法细节：

1.Conv layers

Faster RCNN首先是支持输入任意大小的图片，进入网络之前对图片进行了规整化尺度的设定，如可设定图像短边不超过600，图像长边不超过1000，我们可以假定M*N=1000*600（如果图片少于该尺寸，可以边缘补0，即图像会有黑色边缘）。

VGG16：（2+2+3+3+3）的连续卷积块

13个conv层：kernel_size=3,pad=1,stride=1;
13个relu层：激活函数，不改变图片大小
4个pooling层：kernel_size=2,stride=2;pooling层会让输出图片是输入图片的1/2

经过Conv layers，图片大小变成(M/16)*(N/16)，即：60*40(1000/16≈60,600/16≈40)；则，Feature Map就是60*40*512

注：VGG16是512-d,ZF是256-d，表示特征图的大小为60*40，深度为512

2.RPN

RPN-Anchor box 生成：

经过Conv layers后，图片大小变成了原来的1/16，令feat_stride=16，在生成Anchors时，我们先定义一个base_anchor，大小为16*16的box(因为特征图(60*40)上的一个点，可以对应到原图（1000*600）上一个16*16大小的区域)，源码中转化为[0,0,15,15]的数组，参数ratios=[0.5, 1, 2]scales=[8, 16, 32]

[0,0,15,15],面积保持不变，长、宽比分别为[0.5, 1, 2]是产生的Anchors box：

经过scales变化，即长、宽分别均为 (16*8=128)、(16*16=256)、(16*32=512)，对应anchor box：

综合以上两种变换（1:1,1:2,2:1），最后生成9个Anchor box：

特征图大小为60*40，所以会一共生成60*40*9=21600个Anchor box。

generate_anchors源码解析：

首先看main函数：

if __name__ == '__main__':
    import time
    t = time.time()
    a = generate_anchors()   #最主要的就是这个函数
    print time.time() - t
    print a
    from IPython import embed; embed()

进入到generate_anchors函数中：

def generate_anchors(base_size=16, ratios=[0.5, 1, 2],
                     scales=2**np.arange(3, 6)):
    """
    Generate anchor (reference) windows by enumerating aspect ratios X
    scales wrt a reference (0, 0, 15, 15) window.
    """
 
    base_anchor = np.array([1, 1, base_size, base_size]) - 1
    print ("base anchors",base_anchor)
    ratio_anchors = _ratio_enum(base_anchor, ratios)
    print ("anchors after ratio",ratio_anchors)
    anchors = np.vstack([_scale_enum(ratio_anchors[i, :], scales)
                         for i in xrange(ratio_anchors.shape[0])])
    print ("achors after ration and scale",anchors)
    return anchors

参数有三个：

1.base_size=16

这个参数指定了最初的类似感受野的区域大小，因为经过多层卷积池化之后，feature map上一点的感受野对应到原始图像就会是一个区域，这里设置的是16，也就是feature map上一点对应到原图的大小为16x16的区域。也可以根据需要自己设置。

2.ratios=[0.5,1,2]

这个参数指的是要将16x16的区域，按照1:2,1:1,2:1三种比例进行变换

3.scales=2**np.arange(3, 6)

这个参数是要将输入的区域，的宽和高进行三种倍数，2^3=8，2^4=16，2^5=32倍的放大，如16x16的区域变成(16*8)*(16*8)=128*128的区域，(16*16)*(16*16)=256*256的区域，(16*32)*(16*32)=512*512的区域。

接下来看第一句代码：

base_anchor = np.array([1, 1, base_size, base_size]) - 1
 
'''base_anchor值为[ 0,  0, 15, 15]'''

表示最基本的一个大小为16x16的区域，四个值，分别代表这个区域的左上角和右下角的点的坐标。

ratio_anchors = _ratio_enum(base_anchor, ratios)

这一句是将前面的16x16的区域进行ratio变化，也就是输出三种宽高比的anchors，这里调用了_ratio_enum函数，其定义如下：

输入参数为一个anchor(四个坐标值表示)和三种宽高比例（0.5,1,2）

def _ratio_enum(anchor, ratios):
    """
    Enumerate a set of anchors for each aspect ratio wrt an anchor.
    """
    size = w * h   #size:16*16=256
    size_ratios = size / ratios  #256/ratios[0.5,1,2]=[512,256,128]
    #round()方法返回x的四舍五入的数字，sqrt()方法返回数字x的平方根
    ws = np.round(np.sqrt(size_ratios)) #ws:[23 16 11]
    hs = np.round(ws * ratios)    #hs:[12 16 22],ws和hs一一对应。as:23&12
    #给定一组宽高向量，输出各个预测窗口，也就是将（宽，高，中心点横坐标，中心点纵坐标）的形式，转成
    #四个坐标值的形式
    anchors = _mkanchors(ws, hs, x_ctr, y_ctr)  
    return anchors

在这个函数中又调用了一个_whctrs函数，这个函数定义如下，其主要作用是将输入的anchor的四个坐标值转化成（宽，高，中心点横坐标，中心点纵坐标）的形式。

最后该函数输出的变换了三种宽高比的anchor如下：

ratio_anchors = _ratio_enum(base_anchor, ratios)
'''[[ -3.5,   2. ,  18.5,  13. ],
    [  0. ,   0. ,  15. ,  15. ],
    [  2.5,  -3. ,  12.5,  18. ]]'''

进行完上面的宽高比变换之后，接下来执行的是面积的scale变换：

 anchors = np.vstack([_scale_enum(ratio_anchors[i, :], scales)
                         for i in xrange(ratio_anchors.shape[0])])

这里最重要的是_scale_enum函数，该函数定义如下，对上一步得到的ratio_anchors中的三种宽高比的anchor，再分别进行三种scale的变换，也就是三种宽高比，搭配三种scale，最终会得到9种宽高比和scale 的anchors。这就是论文中每一个点对应的9种anchors。

def _scale_enum(anchor, scales):
    """
    Enumerate a set of anchors for each scale wrt an anchor.
    """
 
    w, h, x_ctr, y_ctr = _whctrs(anchor)
    ws = w * scales
    hs = h * scales
    anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
    return anchors

def _whctrs(anchor):
    """
    Return width, height, x center, and y center for an anchor (window).
    """
    w = anchor[2] - anchor[0] + 1
    h = anchor[3] - anchor[1] + 1
    x_ctr = anchor[0] + 0.5 * (w - 1)
    y_ctr = anchor[1] + 0.5 * (h - 1)
    return w, h, x_ctr, y_ctr

通过这个函数变换之后将原来的anchor坐标（0，0，15，15）转化成了w:16,h:16,x_ctr=7.5,y_ctr=7.5的形式。

_scale_enum函数中也是首先将宽高比变换后的每一个ratio_anchor转化成（宽，高，中心点横坐标，中心点纵坐标）的形式，再对宽和高均进行scale倍的放大，然后再转换成四个坐标值的形式。最终经过宽高比和scale变换得到的9种尺寸的anchors的坐标如下：

anchors = np.vstack([_scale_enum(ratio_anchors[i, :], scales)
                         for i in xrange(ratio_anchors.shape[0])])
'''
[[ -84.  -40.   99.   55.]
 [-176.  -88.  191.  103.]
 [-360. -184.  375.  199.]
 [ -56.  -56.   71.   71.]
 [-120. -120.  135.  135.]
 [-248. -248.  263.  263.]
 [ -36.  -80.   51.   95.]
 [ -80. -168.   95.  183.]
 [-168. -344.  183.  359.]]
'''

源码中，通过width:(0~60)*16,height(0~40)*16建立shift偏移量数组，再和base_ancho基准坐标数组累加，得到特征图上所有像素对应的Anchors的坐标值，是一个[216000,4]的数组。

RPN-分类与回归

Feature Map进入RPN后，先经过一次3*3的卷积，同样，特征图大小依然是60*40*512，这样做的目的应该是进一步集中特征信息，接着看到两个全卷积,即kernel_size=1*1,p=0,stride=1。这里：1*1卷积的意义是改变特征维度（18，36）。

如上图中标识：

① rpn_cls：60*40*512-d ⊕ 1*1*512*18 ==> 60*40*9*2

逐像素对其9个Anchor box进行二分类

② rpn_bbox：60*40*512-d ⊕ 1*1*512*36==>60*40*9*4

逐像素得到其9个Anchor box四个坐标信息（其实是偏移量）

RPN-工作原理

Caffe版本下的网络：

<<1>>rpn-data：

生成Anchor box，对Anchor box进行过滤和打标签（为了softmax），且对Anchor box和Ground Truth计算偏移量（为了SmoothL1）

layer {  
2.      name: 'rpn-data'  
3.      type: 'Python'  
4.      bottom: 'rpn_cls_score'   #仅提供特征图的height和width的参数大小
5.      bottom: 'gt_boxes'        #ground truth box
6.      bottom: 'im_info'         #包含图片大小和缩放比例，可供过滤anchor box
7.      bottom: 'data'  
8.      top: 'rpn_labels'  
9.      top: 'rpn_bbox_targets'  
10.      top: 'rpn_bbox_inside_weights'  
11.      top: 'rpn_bbox_outside_weights'  
12.      python_param {  
13.        module: 'rpn.anchor_target_layer'  
14.        layer: 'AnchorTargetLayer'  
15.        param_str: "'feat_stride': 16 \n'scales': !!python/tuple [8, 16, 32]"  
16.      }  
17.    }

这一层主要是为特征图60*40上的每个像素生成9个Anchor box（位置），并且对生成的Anchor box进行过滤和标记，参照源码。

过滤和标记规则如下：

去除掉超过1000*600这原图的边界的anchor box

如果anchor box与ground truth的IoU>0.7，标记为正样本，label=1

如果anchor box与ground truth的IoU<0.3，标记为负样本，label=0

剩下的既不是正样本也不是负样本，不用于最终训练，label=-1

除了对anchor box进行标记外，另一件事情就是计算anchor box与ground truth之间的偏移量

令：ground truth:标定的框也对应一个中心点位置坐标x*,y*和宽高w*,h*

anchor box: 中心点位置坐标x_a,y_a和宽高w_a,h_a

所以，偏移量：

△x=(x*-x_a)/w_a △y=(y*-y_a)/h_a

△w=log(w*/w_a) △h=log(h*/h_a)

通过ground truth box与预测的anchor box之间的差异来进行学习，从而是RPN网络中的权重能够学习到预测box的能力

<<2>>rpn_loss_cls、rpn_loss_bbox、rpn_cls_prob：

其中‘rpn_loss_cls’、‘rpn_loss_bbox’是分别对应softmax，smooth L1计算损失函数。

Softmax公式，计算各分类的概率值
Softmax Loss公式，RPN进行分类时，即寻找最小Loss值

RPN训练设置：在训练RPN时，一个Mini-batch是由一幅图像中任意选取的256个proposal组成的，其中正负样本的比例为1：1。如果正样本不足128，则多用一些负样本以满足有256个Proposal可以用于训练，反之亦然。

举个栗子：共5000个proposal进行训练，其中正例200，反例4800。每个Mini-batch每次分别随机从正反选出128个。为什么这么做？因为反例一般占据大多数，如果随机选择，则反例会占每个Mini-batch的绝大数，会导致算出的loss不准。

rpn_cls_prob 存放的是每个框的类别得分。是为了进行下一步的nms的操作，众所周知nms是按照置信度(类别得分)来排序的。

<<3>>proposal

layer {  
2.      name: 'proposal'  
3.      type: 'Python'  
4.      bottom: 'rpn_cls_prob_reshape' #[1,18,40,60]==> [batch_size, channel，height，width]Caffe的数据格式，anchor box分类的概率
5.      bottom: 'rpn_bbox_pred'  # 记录训练好的四个回归值△x, △y, △w, △h
6.      bottom: 'im_info'  
7.      top: 'rpn_rois'  
8.      python_param {  
9.        module: 'rpn.proposal_layer'  
10.        layer: 'ProposalLayer'  
11.        param_str: "'feat_stride': 16 \n'scales': !!python/tuple [4, 8, 16, 32]"
12.      }  
13.    }

在输入中我们看到’rpn_bbox_pred’，记录着训练好的四个回归值△x, △y, △w, △h。

源码中，会重新生成60*40*9个anchor box，然后累加上训练好的△x, △y, △w, △h,从而得到了相较于之前更加准确的预测框region proposal。

进一步对预测框进行越界剔除和使用nms非最大值抑制，剔除掉重叠的框；

比如，设定IoU为0.7的阈值，即仅保留覆盖率不超过0.7的局部最大分数的box（粗筛）。最后留下大约2000个anchor，然后再取前N个box（比如300个）；这样，进入到下一层ROI Pooling时region proposal大约只有300个

<<4>>roi_data

layer {  
2.      name: 'roi-data'  
3.      type: 'Python'  
4.      bottom: 'rpn_rois'  
5.      bottom: 'gt_boxes'  
6.      top: 'rois'  
7.      top: 'labels'  
8.      top: 'bbox_targets'  
9.      top: 'bbox_inside_weights'  
10.      top: 'bbox_outside_weights'  
11.      python_param {  
12.        module: 'rpn.proposal_target_layer'  
13.        layer: 'ProposalTargetLayer'  
14.        param_str: "'num_classes': 81"  
15.      }  
16.    }

为了避免定义上的误解，我们将经过‘proposal’后的预测框称为region proposal（其实，RPN层的任务其实已经完成，roi_data属于为下一层准备数据）

主要作用：

RPN层只是来确定region proposal是否是物体(是/否),这里roi_data根据region proposal和ground truth box的最大重叠指定具体的标签(就不再是二分类问题了，参数中指定的是81类)，再次打标签。
计算region proposal与ground truth boxes的偏移量，计算方法和之前的偏移量计算公式相同

经过这一步后的数据输入到ROI Pooling层进行进一步的分类和定位。

3.ROI Pooling:

layer {  
2.      name: "roi_pool5"  
3.      type: "ROIPooling"  
4.      bottom: "conv5_3"   #输入特征图大小
5.      bottom: "rois"      #输入region proposal
6.      top: "pool5"     #输出固定大小的feature map
7.      roi_pooling_param {  
8.        pooled_w: 7  
9.        pooled_h: 7  
10.        spatial_scale: 0.0625 # 1/16  
11.      }  
12.    }

把建议框映射到CNN的最后一层faature map上，进行pooling得到固定大小的proposal feature map

从上述的Caffe代码中可以看到，输入的是RPN层产生的region proposal(假定有300个region proposal box)和VGG16最后一层产生的特征图(60*40 512-d)，遍历每个region proposal，将其坐标值缩小16倍，这样就可以将在原图(1000*600)基础上产生的region proposal映射到60*40的特征图上，从而将在feature map上确定一个区域(定义为RB*)。在feature map上确定的区域RB*，根据参数pooled_w:7,pooled_h:7,将这个RB*区域划分为7*7，即49个相同大小的小区域，对于每个小区域，使用max pooling方式从中选取最大的像素点作为输出，这样，就形成了一个7*7的feature map。

ROI pooling层能实现training和testing的显著加速，并提高检测accuracy。

该层有两个输入：

从具有多个卷积核池化的深度网络中获得的固定大小的feature maps，VGG16最后一层产生的特征图(60*40 512-d)
RPN层产生的region proposal(假定有300个region proposal box)，表示为所有ROI的N*5的矩阵，其中N（300）表示ROI的数目。第一列表示图像index，其余四列表示其余的左上角和右下角坐标；

ROI pooling具体操作如下：参考：https://blog.youkuaiyun.com/qq_38906523/article/details/80190807