YOLO系列之一： YOLO-V1_yolov1-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_42841721/article/details/125373776

YOLO系列之一： YOLO-V1原理和代码总结

入门目标检测，不可少的就是yolo系列了。而整个目标检测其实是个很大的专题。就方法分类而言，有one-stage，two-stage算法，one-stage最经典的当属yolo系列，two-stage算法，一开始有RCNN, 然后fast-RCNN, faster-RCNN, Mask-RCNN等等。还可以分为基于anchor的和anchor free的，比如SSD等，这些都是比较经典的算法，是需要逐步掌握的，目前的学习阶段正在研究yolo。那么先记录一下我学习yolo的过程。

一、什么是yolo？

YOLO: you only look once! 真正的端到端的目标检测算法。在yolo提出之前，大家普遍的做法会比较麻烦，至于怎么麻烦，可以参考RCNN的论文，这篇论文个人觉得是CNN在目标检测中应用的开山之作，里面会给大家介绍一下之前的目标检测的做法。大体上总结一下：获取很多候选框(proposal)图像==> 手工提取图像特征 ==> 利用分类器进行分类每个过程都比较麻烦。
RCNN大致的一个流程：利用selection search(选择性搜索算法)对原图进行候选框proposal的获取，一张图上可以获取好像是4096张的proposal候选图，proposal图丢进CNN中学习，获取表示这个proposal的feature vector, 利用SVM对这群feature vectors进行分类，得到这个proposal代表预测物体的类别。在利用NMS对这些proposal进行筛选，最后对筛选出来的proposal进行边框回归。大致就是这么一回事。看起来非常复杂。。。
那么yolo做了一件什么事呢？在RCNN中，既然都利用CNN来提取图像特征了，那为什么不一次性利用这个CNN进行直接预测呢，如果一个输出向量中，既包含了类别信息，又包含了位置信息，那么这个向量是不是就可以表示这个目标的预测呢？那么yolo就做了这么一件事情。

二、YOLO-v1 基本的实现原理

基本原理：

YOLO-v1 基本的实现原理：把整张图进行分块，原图是448x448，分成7x7个cell, 利用每个cell来进行目标预测。物体object的中心坐标落在哪个cell里面，那么这个cell就负责预测这个object。每个cell，有B个预测框，利用这B个预测框对目标进行预测(论文中B为2)。
在这里插入图片描述

网络的输出：

yolo利用cell进行预测，那么yolo模型的输出当然是每个cell需要预测的内容有哪些维度： B * (X, Y, W, H, C) + num_classes, 论文中这个维度是30，30 = 2 * 5 + 20（类别数）
其中B就是B个bbox, X, Y, W, H分别是预测框的坐标和宽高，C是置信度，就是这个框预测属于某个物体的可能性，num_classes就是该物体属于哪一类的一个one-hot编码对应的输出。所以，模型输出的tensor shape: (batch_size, W, H, 30)。

网络训练的输入：

输入x是原图像，y也就是label，检测中我们用target来代替label。target是和pred维度相同的向量，(batch_size, 7, 7, 30)。那么这个30怎么获取的，X, Y, W, H根据对应图像的目标物体信息可以计算得到(目标物体信息中会有每个物体的xmin, ymin, xmax, ymax), 根据归一化计算，可以计算得到每个cell中的目标框的中心坐标和W,H。在有物体的cell中，C为1，否则为0，num_classes，对应与class编好为几，就在第几个位置标记为1.

loss计算:

loss = noo_loss + contain_loss + not_contain_loss + coord_loss + class_loss
在里插入图片描述

蓝色线部分为coord_loss: 检测目标区域的坐标loss, 用来回归位置的准确性;
红色线部分为检测目标的loss: 检测目标的自信度loss，用来回归是否是物体的自信程度；
黑色线部分为非目标区域的loss: 非目标区域的自信度loss,用来回归非目标区域的自信程度；
绿色线部分为目标区域的类别loss: 目标区域属于某类物体的概率loss.

我们做bbox预测的原则是：GroundTruth的中心落在哪个cell里面，这个cell所属的B个bbox就负责预测那个GroundTruth. 对于target中的所有cell中，有GroundTruth的cell，在confidence的位置，也就是target维度为30的输出向量中的第4，和9个位置上的值为1。并且通过这个方法，可以获取这些有目标的cell和无目标的cell的mask，然后把pred和target的这些cell全部找出来用于计算loss。对于非目标产生的loss，只需要计算confidence loss。

代码如下:

coo_mask = target[:, :, :, 4] > 0  # 有物体的cell坐标mask
noo_mask = target[:, :, :, 4] == 0  # 无物体cell坐标mask
# print("the coo_mask.shape: ", coo_mask.shape)
# print("the noo_mask.shape: ", noo_mask.shape)

coo_mask = coo_mask.unsqueeze(-1).expand_as(target)  # shape : 1, 7, 7, 30 最后一个维度是需要计算的每一项
noo_mask = noo_mask.unsqueeze(-1).expand_as(target)

noo_pred = pred[noo_mask].view(-1, 30)   # 非目标mask 1 (49 - 3), 30 , 其他非目标cell
noo_target = target[noo_mask].view(-1, 30)
# print("the shape of noo_pred", noo_pred.shape)

noo_pred_mask = torch.ByteTensor(noo_pred.size())
noo_pred_mask.zero_()
# print(noo_pred_mask.shape)

noo_pred_mask[:, 4] = 1
noo_pred_mask[:, 9] = 1  # noo_object 在conf设置为mask为1

noo_pred_c = noo_pred[noo_pred_mask]  # 只留下c的计算 需要参与计算的位置留下来
# print("noo_pred_c.shape: ",noo_pred_c.shape)

noo_target_c = noo_target[noo_pred_mask]
# print("noo_target_c: ", noo_target_c.shape)  # noo_target_c 46x2个位置都是0

mse_loss = nn.MSELoss()
nooobj_loss = mse_loss(noo_pred_c,noo_target_c) # 计算非目标损失

对于有目标物体的那些cell，我们只取预测框中，和ground truth目标框IOU最大的，用来计算坐标loss(coord_loss)和conficence loss(contain_loss)，其他的预测框，只计算confidence loss （not_contain_loss）。

IOU计算如下：

def compute_iou(box1, box2):
    N = box1.size(0)
    M = box2.size(0)
    lt = torch.max(box1[:, :2].unsqueeze(1).expand(N, M, 2),
                   box2[:, :2].unsqueeze(0).expand(N, M, 2))
    rb = torch.min(box1[:, 2:].unsqueeze(1).expand(N, M, 2),
                   box2[:, 2:].unsqueeze(0).expand(N, M, 2))
    wh = rb - lt
    wh[wh < 0] = 0
    interation = wh[:, :, 0] * wh[:, :, 1]
    area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])  # (N, )
    area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])  # (M, )
    area1 = area1.unsqueeze(1).expand(interation.shape)
    area2 = area2.unsqueeze(0).expand(interation.shape)

    iou = interation / (area1 + area2 - interation)
    return iou

# 计算目标损失

coo_pred = pred[coo_mask].view(-1, 30)  # 在这3个有目标的cell， pred 30个预测值 (2 * (x, y, w, h, f) + num_classes)
print("coo_pred.shape", coo_pred.shape)
box_pred = coo_pred[:, :10].contiguous().view(-1, 5)  # 每个box的(x, y, w, h, f) 三个cell，每个cell两个box，一共6x5

coo_target = target[coo_mask].view(-1, 30)
box_target = coo_target[:, :10].contiguous().view(-1, 5) # box_target shape (num_objects * num_boxes, 5)
print("box_target shape: ", box_target.shape)

class_pred = coo_pred[:, 10:]
class_target = coo_target[:, 10:]

coo_response_mask = torch.ByteTensor(box_target.size())
coo_response_mask.zero_()

coo_not_response_mask = torch.ByteTensor(box_target.size())
coo_not_response_mask.zero_()

box_target_iou = torch.zeros(box_target.size())
for i in range(0, box_target.size()[0], 2):
    box1 = box_pred[i:i+2]
    box1_xyxy = torch.autograd.Variable(torch.FloatTensor(box1.size()))
    box1_xyxy[:, :2] = box1[:,:2]/7. - 0.5 * box1[:, 2:4]
    box1_xyxy[:, 2:4] = box1[:, :2] / 7. + 0.5 * box1[:, 2:4]
    # print("the shape of box1_xyxy: ", box1_xyxy.shape)

    box2 = box_target[i].view(-1, 5)
    box2_xyxy = torch.autograd.Variable(torch.FloatTensor(box2.size()))
    box2_xyxy[:, :2] = box2[:, :2] / 7. - 0.5 * box2[:, 2:4]
    box2_xyxy[:, 2:4] = box2[:, :2] / 7. + 0.5 * box2[:, 2:4]
    # print("the shape of box2_xyxy: ", box2_xyxy.shape)
    iou = compute_iou(box1_xyxy[:, :4], box2_xyxy[:, :4])  # [2,1]
    # print("the iou shape：", iou.shape)
    # print("the value of iou: ", iou)
    max_iou, max_index = iou.max(0)
    # print("the max_iou: ", max_iou)
    # print("the max_index: ", max_index)
    coo_response_mask[i + max_index] = 1  # 得到负责计算coordinate的box mask， 哪个预测的bbox和target的iou最大，哪个bbox就负责计算coord
    coo_not_response_mask[i + 1 - max_index] = 1  # 得到不负责计算coordinate的box mask

    box_target_iou[i + max_index, torch.LongTensor([4])] = max_iou
    # 将最大的max_iou最为实际的confidence, 用来和pred产生的confidence做mse
box_target_iou = torch.autograd.Variable(box_target_iou)

box_pred_response = box_pred[coo_response_mask].view(-1, 5)
box_target_response_iou = box_target_iou[coo_response_mask].view(-1, 5)
box_target_response = box_target[coo_response_mask].view(-1, 5)

contain_loss = mse_loss(box_pred_response[:, 4], box_target_response_iou[:, 4]) # 计算包含目标的confidence loss
loc_loss = mse_loss(box_pred_response[:, :2], box_target_response[:, :2]) \
           + mse_loss(torch.sqrt(box_pred_response[:, 2:4]), torch.sqrt(box_target_response[:, 2:4]))

# 2. not response loss 还有另一个框，不负责任预测目标的， 也需要计算loss

box_pred_not_response = box_pred[coo_not_response_mask].view(-1, 5)
box_target_not_response = box_target[coo_not_response_mask].view(-1, 5)
box_target_not_response[:, 4] = 0  # 此时，对应的实际confidence应该为0

not_contain_loss = mse_loss(box_pred_not_response[:, 4], box_target_not_response[:, 4])

# 3. class loss
class_loss = mse_loss(class_pred, class_target)

总的loss:

final_loss = not_contain_loss + contain_loss + loc_loss + nooobj_loss + class_loss