yolo系列之yolov2_boxwh-优快云博客

本文链接：https://blog.youkuaiyun.com/mayeight/article/details/117668198

本文概述了YOLOv2算法的关键改进，包括darknet网络结构的应用、13x13特征融合、anchor匹配策略，以及如何通过MSE损失处理预测与真实框的关系。此外，介绍了bn层的引入、多尺度训练和anchor聚类对模型性能的影响。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前面记录了一下yolov1算法的一些关键点,本次依旧从网络，anchor 匹配，后处理，训练几个方面来梳理一下yolov2

1. 网络结构

yolov2的一大创新点就是提出来了darknet 网络结构。其总体网络结构如下图所示

在darknet 中，坐着使用了较多的1*1的卷积核来进行通道压缩。从第三次卷积开始，作者每进行一次正常的卷积之后，由于通道翻倍，使用一个半数的1*1卷积核来进行通道降维实现各通道间信息的融合。

整体网络结构如下图所示，22 之前为darknet 的网络结构，后面的为添加的检测网络

转载说明：route层的作用是进行层的合并

在第27层直接添加了一个passthrough layer 得到26*26的细粒度的特征，然后将26*26*512的feature map的resize 变为 13*13* 2048。和原先13*13*1024的feature map拼接起来，这样就得到了多尺度的信息。这里mAP提高了1%。

30层输出的大小是13*13，是指把图片通过卷积或池化，最后缩小到一个13*13大小的格。每一个格子的output参数是125。所以最后输出的参数一共是13*13*125。

2.anchor 匹配

关于anchor 的匹配，我们要考虑的第一个问题是网络的最开始的输出是什么，网络的输出就是代表网络所学习到的东西，通过网络的输出到实际的框的位置就是解码的过程具体可以参考这篇文章。通过阅读代码

def yolo_head(feats, anchors, num_classes):
    """Convert final layer features to bounding box parameters.

    参数
    ----------
    feats : 神经网络最后一层的输出，shape：[-1,13,13,125]

    anchors : 实际anchor boxes 的值，论文中使用了五个。[w,h]，都是相对于gird cell 长宽的比值。二个维度：
    第一个维度：anchor boxes的数量，这里是5
    第二个维度：[w,h]，w,h,都是相对于gird cell 长宽的比值。
    [1.08, 1.19], [3.42, 4.41], [6.63, 11.38], [9.42, 5.11], [16.62, 10.52]

    num_classes : 类别个数（有多少类）

    返回值
    -------
    box_xy : 每张图片的每个gird cell中的每个pred_boxes中心点x,y相对于其所在gird cell的坐标值，左上顶点为[0,0],右下顶点为[1,1]。
    有五个维度，shape:[-1,13,13,5,2].
    第一个维度：图片张数
    第二个维度：每组x,y在pred_boxes的行坐标信息（y方向上属于第几个gird cell）
    第三个维度：每组x,y在pred_boxes的列坐标信息（x方向上属于第几个gird cell）
    第四个维度：每组x,y的anchor box信息（使用第几个anchor boxes）
    第五个维度：[x,y],中心点x,y相对于gird cell的坐标值
        
    box_wh : 每张图片的每个gird cell中的每个pred_boxes的w,h都是相对于gird cell的比值
    有五个维度，shape:[-1,13,13,5,2].
    第一个维度：图片张数
    第二个维度：每组w,h对应的x,y在pred_boxes的行坐标信息（y方向上属于第几个gird cell）
    第三个维度：每组w,h对应的x,y在pred_boxes的列坐标信息（x方向上属于第几个gird cell）
    第四个维度：每组w,h对应的x,y的anchor box信息（使用第几个anchor boxes）
    第五个维度：[w,h],w,h都是相对于gird cell的比值

    box_confidence : 每张图片的每个gird cell中的每个pred_boxes的，判断是否存在可检测物体的概率。五个维度，shape:[-1,13,13,5,1]。各维度信息同上。

    box_class_pred : 每张图片的每个gird cell中的每个pred_boxes所框起来的各个类别分别的概率(经过了softmax)。shape:[-1,13,13,5,20]
        
    """
    num_anchors = len(anchors)
    # Reshape to batch, height, width, num_anchors, box_params.
    anchors_tensor = K.reshape(K.variable(anchors), [1, 1, 1, num_anchors, 2])

    conv_dims = K.shape(feats)[1:3]  '用多少个gird cell划分图片，这里是13x13'
    # In YOLO the height index is the inner most iteration.
    conv_height_index = K.arange(0, stop=conv_dims[0])
    conv_width_index = K.arange(0, stop=conv_dims[1])
    conv_height_index = K.tile(conv_height_index, [conv_dims[1]])

    conv_width_index = K.tile(
        K.expand_dims(conv_width_index, 0), [conv_dims[0], 1])
    conv_width_index = K.flatten(K.transpose(conv_width_index))
    conv_index = K.transpose(K.stack([conv_height_index, conv_width_index]))
    conv_index = K.reshape(conv_index, [1, conv_dims[0], conv_dims[1], 1, 2])  'shape:[1，13，13，1，2]'
    conv_index = K.cast(conv_index, K.dtype(feats))

    '''
    tile（）：平移，
    expand_dims（）：增加维度
    transpose（）：转置
    flatten（）：降成一维
    stack（）：堆积，增加一个维度
    conv_index:[0,0],[0,1],...,[0,12],[1,0],[1,1],...,[12,12]（大概是这个样子）
    '''

    feats = K.reshape(
        feats, [-1, conv_dims[0], conv_dims[1], num_anchors, num_classes + 5])
    conv_dims = K.cast(K.reshape(conv_dims, [1, 1, 1, 1, 2]), K.dtype(feats))

    box_xy = K.sigmoid(feats[..., :2])
    box_wh = K.exp(feats[..., 2:4])
    box_confidence = K.sigmoid(feats[..., 4:5])
    box_class_probs = K.softmax(feats[..., 5:])

    # Adjust preditions to each spatial grid point and anchor size.
    # Note: YOLO iterates over height index before width index.
    box_xy = (box_xy + conv_index) / conv_dims
    box_wh = box_wh * anchors_tensor / conv_dims

    return box_xy, box_wh, box_confidence, box_class_probs

def preprocess_true_boxes(true_boxes, anchors, image_size):
 """
参数
--------------
true_boxes : 实际框的位置和类别，我们的输入。二个维度：
第一个维度：一张图片中有几个实际框
第二个维度： [x, y, w, h, class]，x,y 是框中心点坐标，w,h 是框的宽度和高度。x,y,w,h 均是除以图片
           分辨率得到的[0,1]范围的比值。
  
anchors : 实际anchor boxes 的值，论文中使用了五个。[w,h]，都是相对于gird cell 的比值。二个维度：
第一个维度：anchor boxes的数量，这里是5
第二个维度：[w,h]，w,h,都是相对于gird cell长宽的比值。
           [1.08, 1.19], [3.42, 4.41], [6.63, 11.38], [9.42, 5.11], [16.62, 10.52]
              
        
image_size : 图片的实际尺寸。这里是416x416。


Returns
--------------
detectors_mask : 取值是0或者1，这里的shape是[13,13,5,1]，四个维度。
第一个维度：true_boxes的中心位于第几行（y方向上属于第几个gird cell）
第二个维度：true_boxes的中心位于第几列（x方向上属于第几个gird cell）
第三个维度：哪个anchor box
第四个维度：0/1。1的就是用于预测改true boxes 的 anchor boxes

matching_true_boxes: 这里的shape是[13,13,5,5]，四个维度。
第一个维度：true_boxes的中心位于第几行（y方向上属于第几个gird cel）
第二个维度：true_boxes的中心位于第几列（x方向上属于第几个gird cel）
第三个维度：第几个anchor box
第四个维度：[x,y,w,h,class]。这里的x，y表示offset，是相当于gird cell的，w,h是取了log函数的，
class是属于第几类。后面的代码会详细看到
"""

    height, width = image_size
    num_anchors = len(anchors)

    assert height % 32 == 0,   '输入的图片的高度必须是32的倍数，不然会报错。'
    assert width % 32 == 0,   '输入的图片的宽度必须是32的倍数，不然会报错。'

    conv_height = height // 32    '进行gird cell划分'
    conv_width = width // 32    '进行gird cell划分'

    num_box_params = true_boxes.shape[1] 
    detectors_mask = np.zeros(
        (conv_height, conv_width, num_anchors, 1), dtype=np.float32)
    matching_true_boxes = np.zeros(
        (conv_height, conv_width, num_anchors, num_box_params),
        dtype=np.float32)    '确定detectors_mask和matching_true_boxes的维度，用0填充'

    for box in true_boxes:    '遍历实际框'
        box_class = box[4:5]    '提取类别信息，属于哪类'

        box = box[0:4] * np.array(
            [conv_width, conv_height, conv_width, conv_height])   '换算成相对于gird cell的值'

        i = np.floor(box[1]).astype('int')    '（y方向上属于第几个gird cell）'
        j = np.floor(box[0]).astype('int')    '（x方向上属于第几个gird cell）'
        best_iou = 0
        best_anchor = 0


        '计算anchor boxes 和 true boxes的iou，找到最佳预测的一个anchor boxes'
        for k, anchor in enumerate(anchors):
            # Find IOU between box shifted to origin and anchor box.
            box_maxes = box[2:4] / 2.
            box_mins = -box_maxes
            anchor_maxes = (anchor / 2.)
            anchor_mins = -anchor_maxes

            intersect_mins = np.maximum(box_mins, anchor_mins)
            intersect_maxes = np.minimum(box_maxes, anchor_maxes)
            intersect_wh = np.maximum(intersect_maxes - intersect_mins, 0.)
            intersect_area = intersect_wh[0] * intersect_wh[1]
            box_area = box[2] * box[3]
            anchor_area = anchor[0] * anchor[1]
            iou = intersect_area / (box_area + anchor_area - intersect_area)
            if iou > best_iou:
                best_iou = iou
                best_anchor = k


        if best_iou > 0:
            detectors_mask[i, j, best_anchor] = 1  '找到最佳预测anchor boxes'
            adjusted_box = np.array(
                [
                    box[0] - j, box[1] - i, 'x,y都是相对于gird cell的位置，左上角[0,0]，右下角[1,1]'
                    np.log(box[2] / anchors[best_anchor][0]),  '对应实际框w,h和anchor boxes w,h的比值取log函数'
                    np.log(box[3] / anchors[best_anchor][1]), box_class  'class实际框的物体是属于第几类'
                ],
                dtype=np.float32)
            matching_true_boxes[i, j, best_anchor] = adjusted_box   
    return detectors_mask, matching_true_boxes

以上代码就是yolov2的编码和解码过程，我们来结合下面的公式来进行说明，其中，t系列就是网络的输出，c系列是网格的起始点，p系列是预设的anchor 长宽。

因此我们可以这样理解，网络的输出是的在长宽领域是（预测值/真实值求对数）,在中心点领域，求了sigmoid 运算之后，为相对 grid cell 的偏移值,即（tx,ty ）本身不是相对偏移，delta(tx),delta(ty)才是相对偏移。