手搓YOLOV1——使用Pytorch复现YOLOV1网络

最新推荐文章于 2025-04-14 17:06:31 发布

AutumnorLiuu

最新推荐文章于 2025-04-14 17:06:31 发布

阅读量910

点赞数 28

文章标签： YOLO pytorch 网络人工智能

本文链接：https://blog.youkuaiyun.com/AutumnorLiu/article/details/146097030

版权

从零搭建yolov1网络，包括训练和测试。

本文代码基于https://github.com/abeardear/pytorch-YOLO-v1修改，可运行代码Autumnorliu/Yolov1: 从零搭建yolov1

1.yolov1原理

yolov1是在2016年由Joseph Redmon等人提出，是one-stage目标检测算法的开山之作。在YOLOv1被提出之前，基于深度学习的目标检测算法主要是two-stage的，而YOLOv1的出现则彻底颠覆了这一传统框架，实现了实时目标检测。

yolov1将分类问题转换为回归问题，只需要看一次图像，就能直接从图像像素到边界框坐标和类概率，速度++

在这里插入图片描述

如图, YOLO v1的网络架构为24个卷积层、4个最大池化层、2个全连接层组成，卷积和池化层部分用于特征的提取，全连接层用于预测。全连接层输出7x7x30，7x7代表原图被划分成的7x7的grid cell

在这里插入图片描述

yolov1将图像划分为7*7的网格，如果一个物体的中心落在一个网格单元中，这个网格单元负责检测这个物体。每个网格单元预测B个边界框（xywh）、置信度和C个类别概率。预测信息被编码成SxSx(BX5+C)个tensor。本篇文章基于VOC2012数据集训练，最后的输出为7x7x(2x5+20)。

2.模型结构定义

在这里插入图片描述

模型输入为448x448x3，第一层卷积为7x7大卷积核，共24层：22个卷积层，后面跟着2层全连接层，最后输出为7x7x30.

原文中没有使用BN层，但是为了加速收敛和和减小梯度消失和爆炸，本文使用了BN层。

基本模块：

class CBL(nn.Module):
    """
    Conv-BN-LeakyReLU
    """
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        super(CBL, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding=padding, bias=False) # 使用了BN，不需要偏置
        self.bn = nn.BatchNorm2d(out_channels)
        self.act = nn.LeakyReLU(inplace=True)  # 原地操作数据，减少内存开销，提升速度

    def forward(self, x):
        x = self.act(self.bn(self.conv(x)))
        return x

Backbone层，用于提取特征:

class Backbone(nn.Module):
    def __init__(self):
        super(Backbone, self).__init__()
        self.backbone = nn.Sequential(
            # 输入 (448, 448, 3)
            CBL(3, 64, 7, 2, 3),  # (224, 224, 64)
            nn.MaxPool2d(2, 2),  # (112, 112, 64)
            
            CBL(64, 192, 3, 1, 1),   # (112, 112, 192)
            nn.MaxPool2d(2, 2),  # (56, 56, 192)
            
            CBL(192, 128, 1, 1, 0), # (56, 56, 128)
            CBL(128, 256, 3, 1, 1), # (56, 56, 256)
            CBL(256, 256, 1, 1, 0), # (56, 56, 256)
            CBL(256, 512, 3, 1, 1), # (56, 56, 512)
            nn.MaxPool2d(2, 2), # (28, 28, 512)
            
            CBL(512, 256, 1, 1, 0), # (28, 28, 256)
            CBL(256, 512, 3, 1, 1), # (28, 28, 512) ,重复4次
            CBL(512, 256, 1, 1, 0),
            CBL(256, 512, 3, 1, 1),
            CBL(512, 256, 1, 1, 0),
            CBL(256, 512, 3, 1, 1),
            CBL(512, 256, 1, 1, 0),
            CBL(256, 512, 3, 1, 1), # (28, 28, 512)
            CBL(512, 512, 1, 1, 0), # (28, 28, 512)
            CBL(512, 1024, 3, 1, 1), # (28, 28, 1024)
            nn.MaxPool2d(2, 2), # (14, 14, 1024)

            CBL(1024, 512, 1, 1, 0),  # 重复两次
            CBL(512, 1024, 3, 1, 1), 
            CBL(1024, 512, 1, 1, 0),
            CBL(512, 1024, 3, 1, 1), # (14, 14, 1024)
            CBL(1024, 1024, 3, 1, 1), # (14, 14, 1024)
            CBL(1024, 1024, 3, 2, 1), # (7, 7, 1024)
        )

    def forward(self, x):
        x = self.backbone(x)
        return x

Head层，为了减小模型尺寸，这里将4096->2048

class Head(nn.Module):
    """
    检测头由两个全连接层构成，第一层 7x7x1024->4096, 第二层4096->7x7*30
    """
    def __init__(self, num_classes=20):
        super(Head, self).__init__()
        self.classifier = nn.Sequential(
            nn.Linear(7*7*1024, 2048, bias=True), # 这里偏置不能少
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(2048, (num_classes+10)*7*7, bias=True),  # 修改了4096->2048
        )
        
    def forward(self, x):
        return self.classifier(x)

整体网络：

class Yolo(nn.Module):
    """
    Yolo网络由backbone和head构成，backbone输出7x7x1024，head输出7x7x30
    """
    def __init__(self, num_classes=20):
        super(Yolo, self).__init__()
        self.backbone = Backbone()
        self.head = Head(num_classes)
        
        # 权重初始化
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                m.weight.data.normal_(0, 0.01)
                m.bias.data.zero_()

    def forward(self, x):
        x = self.backbone(x)
        # batch_size * channel * width * height
        x = x.permute(0, 2, 3, 1)
        x = torch.flatten(x, start_dim=1, end_dim=3)  # 平铺向量
        x = self.head(x)
        x = F.sigmoid(x) # 归一化到0-1
        x = x.view(-1,7,7,30) # 重塑成bs,7,7,30张量
        return x

3.数据处理和编码

这部分我们的目的是根据标注好的图片的txt文件，将图像标注信息转换为一个7x7x30的tensor。前面我们知道，每个cell包含了2个box(xywh,conf)信息，此次编码会将GT框的中心所在的cell，将其box conf，也就是该cell的位置4和9，都置为1，然后两个box的xywh都是一样的，为归一化后的坐标，然后对应的类别概率置为1，其它类别为0。没有包含中心的cell，所有值都为0。

在这里插入图片描述

例如这张图，它的宽高分别为500，332，标注框信息为25 34 419 271 16(xyxyc)：

它归一化后框坐标为(0.05, 0.1024, 0.8380, 0.8163)

框的中心坐标为（0.4440， 0.4593）

cell框大小为 1/7 = 0.143

得到该中心点在7x7网格的位置(0.440/0.143, 0.4593/0.143),向下取整->(3, 3)，表示在第4行第4列，将该位置数据conf位置填为1、相应类别概率置为1。

然后计算该点相对于该位置cell左上角的偏移量，取出其中一列数据展开：

在这里插入图片描述

因为该图只有一个GT框，所以除了上表第四行数据外，其余全是0。

有了偏移量和wh后，再将该行数据第1 2 5 6位置填上偏移量，3 4 7 8位置填w,h（w,h为相对于整张图片的宽高）

下面是对图片进行编码的代码，传入的boxes是归一化后的xyxy坐标

def encoder(self, boxes, labels):
    '''
    boxes (tensor) [[x1,y1,x2,y2],[]] ex:tensor([[0.0500, 0.1024, 0.8380, 0.8163]])
    labels (tensor) [...]
    return 7x7x30
    '''
    grid_num = 7
    target = torch.zeros((grid_num, grid_num, 30))
    cell_size = 1. / grid_num
    wh = boxes[:, 2:] - boxes[:, :2]  # 计算标注框wh
    cxcy = (boxes[:, 2:] + boxes[:, :2]) / 2  # 标注框中心坐标
    for i in range(cxcy.size()[0]):
        cxcy_sample = cxcy[i]
        ij = (cxcy_sample / cell_size).ceil() - 1  #
        target[int(ij[1]), int(ij[0]), 4] = 1
        target[int(ij[1]), int(ij[0]), 9] = 1
        target[int(ij[1]), int(ij[0]), int(labels[i]) + 9] = 1
        xy = ij * cell_size  # 匹配到的网格的左上角相对坐标
        delta_xy = (cxcy_sample - xy) / cell_size  # 相对于cell左上角偏移量
        target[int(ij[1]), int(ij[0]), 2:4] = wh[i]
        target[int(ij[1]), int(ij[0]), :2] = delta_xy
        target[int(ij[1]), int(ij[0]), 7:9] = wh[i]
        target[int(ij[1]), int(ij[0]), 5:7] = delta_xy
    return target

整个数据读取代码如下：

dataset.py

'''
txt描述文件 image_name.jpg x y w h c x y w h c 这样就是说一张图片中有两个目标
'''
import os
import os.path

import numpy as np

import torch
import torch.utils.data as data
import torchvision.transforms as transforms

import cv2


class YoloDataset(data.Dataset):
    image_size = 448
    def __init__(self, root, list_file, train, transform=None):
        print('data init')
        self.root = root
        self.train = train
        self.transform = transform
        self.fnames = []
        self.boxes = []
        self.labels = []
        self.mean = (123, 117, 104)  # RGB

        with open(list_file) as f:
            lines = f.readlines()

        for line in lines:
            splited = line.strip().split()
            self.fnames.append(splited[0])
            num_boxes = (len(splited) - 1) // 5
            box = []
            label = []
            for i in range(num_boxes):
                x = float(splited[1 + 5 * i])
                y = float(splited[2 + 5 * i])
                x2 = float(splited[3 + 5 * i])
                y2 = float(splited[4 + 5 * i])
                c = splited[5 + 5 * i]
                box.append([x, y, x2, y2])
                label.append(int(c) + 1)
            self.boxes.append(torch.Tensor(box))
            self.labels.append(torch.LongTensor(label))
        self.num_samples = len(self.boxes)

    def __getitem__(self, idx):
        fname = self.fnames[idx]
        img_path = os.path.join(self.root, 'images', fname)
        img = cv2.imread(img_path)
        boxes = self.boxes[idx].clone()
        labels = self.labels[idx].clone()

        h, w, _ = img.shape
        boxes /= torch.Tensor([w, h, w, h]).expand_as(boxes)  # 归一化坐标
        img = self.BGR2RGB(img)  # because pytorch pretrained model use RGB
        img = self.subMean(img, self.mean)  # 减去均值
        img = cv2.resize(img, (self.image_size, self.image_size))
        target = self.encoder(boxes, labels)  # 7x7x30
        for t in self.transform:
            img = t(img)

        return img, target

    # 必要，数据迭代时需要len()
    def __len__(self):
        return self.num_samples

    def encoder(self, boxes, labels):
        '''
        boxes (tensor) [[x1,y1,x2,y2],[]] ex:tensor([[0.0500, 0.1024, 0.8380, 0.8163]])
        labels (tensor) [...]
        return 7x7x30
        '''
        grid_num = 7
        target = torch.zeros((grid_num, grid_num, 30))
        cell_size = 1. / grid_num
        wh = boxes[:, 2:] - boxes[:, :2]  # 计算标注框wh
        cxcy = (boxes[:, 2:] + boxes[:, :2]) / 2  # 标注框中心坐标
        for i in range(cxcy.size()[0]):
            cxcy_sample = cxcy[i]
            ij = (cxcy_sample / cell_size).ceil() - 1  #
            target[int(ij[1]), int(ij[0]), 4] = 1
            target[int(ij[1]), int(ij[0]), 9] = 1
            target[int(ij[1]), int(ij[0]), int(labels[i]) + 9] = 1
            xy = ij * cell_size  # 匹配到的网格的左上角相对坐标
            delta_xy = (cxcy_sample - xy) / cell_size  # 相对于cell左上角偏移量
            target[int(ij[1]), int(ij[0]), 2:4] = wh[i]
            target[int(ij[1]), int(ij[0]), :2] = delta_xy
            target[int(ij[1]), int(ij[0]), 7:9] = wh[i]
            target[int(ij[1]), int(ij[0]), 5:7] = delta_xy
        return target

    def BGR2RGB(self, img):
        return cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    def subMean(self, bgr, mean):
        mean = np.array(mean, dtype=np.float32)
        bgr = bgr - mean
        return bgr


if __name__ == '__main__':
    from torch.utils.data import DataLoader

    root = 'D:\codes\python\Yolov1\datasets'
    list_file = os.path.join(root, 'train.txt')
    train_dataset = YoloDataset(root, list_file, train=True, transform=[transforms.ToTensor()])
    train_loader = DataLoader(train_dataset, batch_size=1, shuffle=False, num_workers=0)
    train_iter = iter(train_loader)
    for i in range(10):
        img, target = next(train_iter)
        print(img.shape)

数据组织形式如下:

/datasets1
    ├── images
    │   ├── 1.jpg
    │   └── 2.jpg
    ├── outputs
    │   ├── 1.xml
    │   └── 2.xml
    ├── train.txt

4.Loss定义和计算

原文损失函数定义：
在这里插入图片描述

yolov1的损失分为3部分：

在YOLO目标检测模型中，损失函数用于优化模型的训练，使其能更好地预测边界框和类别概率。文中的损失函数是一个多部分组成的函数，具体如下：

(1)坐标预测损失
用于衡量预测边界框坐标与真实边界框坐标之间的差异，计算方式为预测坐标与真实坐标差值的平方和，同时对宽度和高度的预测误差计算采用平方根形式，以此部分解决大、小边界框误差同等加权的问题(原文为：Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.)。公式为：
$\lambda_{coord} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj}[(x_{i}-\hat{x}_{i})^{2}+(y_{i}-\hat{y}_{i})^{2}]+\lambda_{coord} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj}[(\sqrt{w_{i}}-\sqrt{\hat{w}_{i}})^{2}+(\sqrt{h_{i}}-\sqrt{\hat{h}_{i}})^{2}]$

其中：

$\lambda_{coord}$ 是坐标损失的权重（设置为5）。
$S$ 是网格的数量。
$B$ 是每个网格预测的边界框数量。
$\mathbb{1}_{ij}^{obj}$ 表示第 $i$ 个网格的第 $j$ 个边界框是否负责预测一个物体。
$x_{i}, y_{i}, w_{i}, h_{i})$ 是预测的边界框坐标和宽高。
$(\hat{x}_{i}, \hat{y}_{i}, \hat{w}_{i}, \hat{h}_{i})$ 是真实边界框的坐标和宽高。

(2)置信度预测损失
用来衡量预测边界框中包含物体的置信度与真实情况的差异。对于包含物体的边界框和不包含物体的边界框分别计算损失，通过设置不同权重（含物体时权重为1，不含物体时权重为 $\lambda_{noobj}$ ，设置为0.5 ）调整二者在损失计算中的重要性。公式为：
$\sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj}(C_{i}-\hat{C}_{i})^{2}+\lambda_{noobj} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{ij}^{noobj}(C_{i}-\hat{C}_{i})^{2}$

其中：

$C_{i}$ 是预测的置信度。
$\hat{C}_{i}$ 是真实的置信度。

(3)类别预测损失
用于衡量预测的类别概率与真实类别之间的差异，仅在网格单元格中存在物体时计算该损失。公式为：
$\sum_{i=0}^{S^{2}} \mathbb{1}_{i}^{obj}(p_{i}(c)-\hat{p}_{i}(c))^{2}$

其中：

$p_{i}(c)$ 是预测的类别概率。
$\hat{p}_{i}(c)$ 是真实的类别概率。

这个损失函数综合考虑了边界框坐标、置信度和类别预测的误差，通过对不同部分损失的加权求和，实现对模型的优化。在训练过程中，模型会不断调整参数，使损失函数的值最小化，从而提高目标检测的性能。

Loss计算流程图如下：

# encoding:utf-8

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable


class yoloLoss(nn.Module):
    def __init__(self, S, B, l_coord, l_noobj):
        super(yoloLoss, self).__init__()
        self.S = S
        self.B = B
        self.l_coord = l_coord
        self.l_noobj = l_noobj

    def compute_iou(self, box1, box2):
        '''Compute the intersection over union of two set of boxes, each box is [x1,y1,x2,y2].
        Args:
          box1: (tensor) bounding boxes, sized [N,4].
          box2: (tensor) bounding boxes, sized [M,4].
        Return:
          (tensor) iou, sized [N,M].
        '''
        N = box1.size(0)
        M = box2.size(0)

        lt = torch.max(
            box1[:, :2].unsqueeze(1).expand(N, M, 2),  # [N,2] -> [N,1,2] -> [N,M,2]
            box2[:, :2].unsqueeze(0).expand(N, M, 2),  # [M,2] -> [1,M,2] -> [N,M,2]
        )

        rb = torch.min(
            box1[:, 2:].unsqueeze(1).expand(N, M, 2),  # [N,2] -> [N,1,2] -> [N,M,2]
            box2[:, 2:].unsqueeze(0).expand(N, M, 2),  # [M,2] -> [1,M,2] -> [N,M,2]
        )

        wh = rb - lt  # [N,M,2]
        wh[wh < 0] = 0  # clip at 0
        inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

        area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])  # [N,]
        area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])  # [M,]
        area1 = area1.unsqueeze(1).expand_as(inter)  # [N,] -> [N,1] -> [N,M]
        area2 = area2.unsqueeze(0).expand_as(inter)  # [M,] -> [1,M] -> [N,M]

        iou = inter / (area1 + area2 - inter)
        return iou

    def forward(self, pred_tensor, target_tensor):
        '''
        pred_tensor: (tensor) size(batchsize,S,S,Bx5+20=30) [x,y,w,h,c]
        target_tensor: (tensor) size(batchsize,S,S,30)
        '''
        N = pred_tensor.size()[0]  # batch size
        coo_mask = target_tensor[:, :, :, 4] > 0  # (BS,7,7)
        noo_mask = target_tensor[:, :, :, 4] == 0  # (BS,7,7)
        coo_mask = coo_mask.unsqueeze(-1).expand_as(target_tensor)  # (BS,7,7,30)
        noo_mask = noo_mask.unsqueeze(-1).expand_as(target_tensor)  # (BS,7,7,30)

        # 提取包含目标的cell的预测值和target值
        coo_pred = pred_tensor[coo_mask].view(-1, 30)  # 此处不确定维度，根据训练变化
        box_pred = coo_pred[:, :10].contiguous().view(-1, 5)  # box[x1,y1,w1,h1,c1]
        class_pred = coo_pred[:, 10:]  # [x2,y2,w2,h2,c2]

        coo_target = target_tensor[coo_mask].view(-1, 30)  # (x, 30)
        box_target = coo_target[:, :10].contiguous().view(-1, 5)
        class_target = coo_target[:, 10:]

        # compute not contain obj loss
        noo_pred = pred_tensor[noo_mask].view(-1, 30)
        noo_target = target_tensor[noo_mask].view(-1, 30)
        noo_pred_mask = torch.cuda.ByteTensor(noo_pred.size())
        noo_pred_mask.zero_()
        noo_pred_mask[:, 4] = 1;
        noo_pred_mask[:, 9] = 1
        noo_pred_c = noo_pred[noo_pred_mask]  # noo pred只需要计算 c 的损失 size[-1,2]
        noo_target_c = noo_target[noo_pred_mask]
        nooobj_loss = F.mse_loss(noo_pred_c, noo_target_c, size_average=False)

        # compute contain obj loss
        coo_response_mask = torch.cuda.ByteTensor(box_target.size())
        coo_response_mask.zero_()
        coo_not_response_mask = torch.cuda.ByteTensor(box_target.size())
        coo_not_response_mask.zero_()
        box_target_iou = torch.zeros(box_target.size()).cuda()
        for i in range(0, box_target.size()[0], 2):  # choose the best iou box
            box1 = box_pred[i:i + 2]
            box1_xyxy = Variable(torch.FloatTensor(box1.size()))
            box1_xyxy[:, :2] = box1[:, :2] / 14. - 0.5 * box1[:, 2:4]
            box1_xyxy[:, 2:4] = box1[:, :2] / 14. + 0.5 * box1[:, 2:4]
            box2 = box_target[i].view(-1, 5)
            box2_xyxy = Variable(torch.FloatTensor(box2.size()))
            box2_xyxy[:, :2] = box2[:, :2] / 14. - 0.5 * box2[:, 2:4]
            box2_xyxy[:, 2:4] = box2[:, :2] / 14. + 0.5 * box2[:, 2:4]
            iou = self.compute_iou(box1_xyxy[:, :4], box2_xyxy[:, :4])  # [2,1]
            max_iou, max_index = iou.max(0)
            max_index = max_index.data.cuda()

            coo_response_mask[i + max_index] = 1
            coo_not_response_mask[i + 1 - max_index] = 1

            #####
            # we want the confidence score to equal the
            # intersection over union (IOU) between the predicted box
            # and the ground truth
            #####
            box_target_iou[i + max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
        box_target_iou = Variable(box_target_iou).cuda()
        # 1.response loss
        box_pred_response = box_pred[coo_response_mask].view(-1, 5)
        box_target_response_iou = box_target_iou[coo_response_mask].view(-1, 5)
        box_target_response = box_target[coo_response_mask].view(-1, 5)
        contain_loss = F.mse_loss(box_pred_response[:, 4], box_target_response_iou[:, 4], size_average=False)
        loc_loss = F.mse_loss(box_pred_response[:, :2], box_target_response[:, :2], size_average=False) + F.mse_loss(
            torch.sqrt(box_pred_response[:, 2:4]), torch.sqrt(box_target_response[:, 2:4]), size_average=False)
        # 2.not response loss
        box_target_not_response = box_target[coo_not_response_mask].view(-1, 5)
        box_target_not_response[:, 4] = 0
        # not_contain_loss = F.mse_loss(box_pred_response[:,4],box_target_response[:,4],size_average=False)

        # 3.class loss
        class_loss = F.mse_loss(class_pred, class_target, size_average=False)

        return (
                    self.l_coord * loc_loss + 2 * contain_loss + self.l_noobj * nooobj_loss + class_loss) / N

5.训练

在这里插入图片描述

原文学习率策略如上，本文为了快速出结果没有采取上面策略，而是保持lr=0.001。除此之外，模型每训练10个epoch保存一次。

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
import torch.nn as nn
import numpy as np
import torchvision.transforms as transforms

from torch.utils.data import DataLoader
from torchvision import models
from torch.autograd import Variable

from yolo_loss import yoloLoss
from dataset import YoloDataset
from yolo import Yolo


use_gpu = torch.cuda.is_available()
file_root = r'E:\datasets\VOCtrainval_11-May-2012\VOCdevkit\VOC2012\JPEGImages'
learning_rate = 0.001
num_epochs = 100
batch_size = 64


if __name__ == '__main__':
    net = Yolo()

    print('cuda', torch.cuda.current_device(), torch.cuda.device_count())

    criterion = yoloLoss(7, 2, 5, 0.5)
    if use_gpu:
        net.cuda()

    net.train()
    params = net.parameters()
    optimizer = torch.optim.SGD(params, lr=learning_rate, momentum=0.9, weight_decay=5e-4)
    # optimizer = torch.optim.Adam(net.parameters(),lr=learning_rate,weight_decay=1e-4)

    train_dataset = YoloDataset(root=file_root, list_file='voc2012.txt', train=True, transform=[transforms.ToTensor()])
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=8)
    # test_dataset = YoloDataset(root=file_root,list_file='voc2007test.txt',train=False,transform = [transforms.ToTensor()] )
    # test_loader = DataLoader(test_dataset,batch_size=batch_size,shuffle=False,num_workers=4)
    print('the dataset has %d images' % (len(train_dataset)))
    print('the batch_size is %d' % (batch_size))
    logfile = open('log.txt', 'w')

    num_iter = 0
    best_test_loss = np.inf
    for epoch in range(num_epochs):
        net.train()
        print('\n\nStarting epoch %d / %d' % (epoch + 1, num_epochs))
        print('Learning Rate for this epoch: {}'.format(learning_rate))

        total_loss = 0.

        for i, (images, target) in enumerate(train_loader):
            images = Variable(images)
            target = Variable(target)
            if use_gpu:
                images, target = images.cuda(), target.cuda()

            pred = net(images)
            loss = criterion(pred, target)
            total_loss += loss.item()

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if (i + 1) % 10 == 0:
                print('Epoch [%d/%d], Iter [%d/%d] Loss: %.4f, average_loss: %.4f'
                    % (epoch + 1, num_epochs, i + 1, len(train_loader), loss.item(), total_loss / (i + 1)))
                num_iter += 1
        
        if (epoch + 1) % 10 == 0:
            # 选择一个文件名或路径来保存模型，这里使用epoch数作为文件名的一部分
            model_save_path = 'model_epoch_{}.pth'.format(epoch + 1)
            # 保存模型的状态字典
            torch.save(net.state_dict(), model_save_path)
            print(f'Model saved to {model_save_path}')

6.预测

# encoding:utf-8

import torch
from torch.autograd import Variable
import torch.nn as nn

from yolo import Yolo
import torchvision.transforms as transforms
import cv2
import numpy as np

VOC_CLASSES = (  # always index 0
    'aeroplane', 'bicycle', 'bird', 'boat',
    'bottle', 'bus', 'car', 'cat', 'chair',
    'cow', 'diningtable', 'dog', 'horse',
    'motorbike', 'person', 'pottedplant',
    'sheep', 'sofa', 'train', 'tvmonitor')

Color = [[0, 0, 0],
         [128, 0, 0],
         [0, 128, 0],
         [128, 128, 0],
         [0, 0, 128],
         [128, 0, 128],
         [0, 128, 128],
         [128, 128, 128],
         [64, 0, 0],
         [192, 0, 0],
         [64, 128, 0],
         [192, 128, 0],
         [64, 0, 128],
         [192, 0, 128],
         [64, 128, 128],
         [192, 128, 128],
         [0, 64, 0],
         [128, 64, 0],
         [0, 192, 0],
         [128, 192, 0],
         [0, 64, 128]]


def decoder(pred):
    '''
    pred (tensor) 1x7x7x30
    return (tensor) box[[x1,y1,x2,y2]] cls_index[cls] probs[x]
    '''
    grid_num = 7
    boxes = []
    cls_indexs = []
    probs = []
    cell_size = 1. / grid_num
    pred = pred.data
    pred = pred.squeeze(0)  # 7x7x30
    contain1 = pred[:, :, 4].unsqueeze(2) # 7x7x1->7x7
    contain2 = pred[:, :, 9].unsqueeze(2)
    contain = torch.cat((contain1, contain2), 2)  # 7x7x2
    mask1 = contain > 0.1  # conf大于阈值
    mask2 = (contain == contain.max())  # we always select the best contain_prob what ever it>0.9
    mask = (mask1 + mask2).gt(0)  # 7x7x2 greater than返回一个新的布尔张量，其中所有大于 0 的位置
    # min_score,min_index = torch.min(contain,2) #每个cell只选最大概率的那个预测框
    for i in range(grid_num):
        for j in range(grid_num):
            for b in range(2):
                # index = min_index[i,j]
                # mask[i,j,index] = 0
                if mask[i, j, b] == 1:
                    # print(i,j,b)
                    box = pred[i, j, b * 5:b * 5 + 4]  # ex:tensor([0.33, 0.75, 0.30, 0.43]) (xywh形式)
                    contain_prob = torch.FloatTensor([pred[i, j, b * 5 + 4]]) # ex:tensor([0.9])
                    xy = torch.FloatTensor([j, i]) * cell_size  # cell左上角  up left of cell
                    box[:2] = box[:2] * cell_size + xy  # return cxcy relative to image
                    box_xy = torch.FloatTensor(box.size())  # 转换成xy形式    convert[cx,cy,w,h] to [x1,xy1,x2,y2]
                    box_xy[:2] = box[:2] - 0.5 * box[2:]
                    box_xy[2:] = box[:2] + 0.5 * box[2:]
                    max_prob, cls_index = torch.max(pred[i, j, 10:], 0)
                    if float((contain_prob * max_prob)[0]) > 0.1:
                        boxes.append(box_xy.view(1, 4))
                        cls_indexs.append(cls_index.unsqueeze(0))
                        probs.append(contain_prob * max_prob)

    if len(boxes) == 0:
        boxes = torch.zeros((1, 4))
        probs = torch.zeros(1)
        cls_indexs = torch.zeros(1)
    else:
        boxes = torch.cat(boxes, 0)  # (n,4)
        probs = torch.cat(probs, 0)  # (n,)
        cls_indexs = torch.cat(cls_indexs, 0)  # (n,)
    keep = nms(boxes, probs)
    return boxes[keep], cls_indexs[keep], probs[keep]


def nms(bboxes, scores, threshold=0.5):
    '''
    bboxes(tensor) [N,4]
    scores(tensor) [N,]
    '''
    x1 = bboxes[:, 0]
    y1 = bboxes[:, 1]
    x2 = bboxes[:, 2]
    y2 = bboxes[:, 3]
    areas = (x2 - x1) * (y2 - y1)

    _, order = scores.sort(0, descending=True)
    keep = []
    while order.numel() > 0:
        if order.numel() == 1:
            break
        i = order[0]
        keep.append(i)

        xx1 = x1[order[1:]].clamp(min=x1[i])
        yy1 = y1[order[1:]].clamp(min=y1[i])
        xx2 = x2[order[1:]].clamp(max=x2[i])
        yy2 = y2[order[1:]].clamp(max=y2[i])

        w = (xx2 - xx1).clamp(min=0)
        h = (yy2 - yy1).clamp(min=0)
        inter = w * h

        ovr = inter / (areas[i] + areas[order[1:]] - inter)
        ids = (ovr <= threshold).nonzero().squeeze()
        if ids.numel() == 0:
            break
        order = order[ids + 1]
    return torch.LongTensor(keep)


#
# start predict one image
#
def predict_gpu(model, image_name, root_path=''):
    result = []
    image = cv2.imread(root_path + image_name)
    h, w, _ = image.shape
    img = cv2.resize(image, (448, 448))
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    mean = (123, 117, 104)  # RGB
    img = img - np.array(mean, dtype=np.float32)

    transform = transforms.Compose([transforms.ToTensor(), ])
    img = transform(img)
    img = Variable(img[None, :, :, :], volatile=True)
    img = img.cuda()

    pred = model(img)  # 1x7x7x30
    pred = pred.cpu()
    boxes, cls_indexs, probs = decoder(pred)

    for i, box in enumerate(boxes):
        x1 = int(box[0] * w)
        x2 = int(box[2] * w)
        y1 = int(box[1] * h)
        y2 = int(box[3] * h)
        cls_index = cls_indexs[i]
        cls_index = int(cls_index)  # convert LongTensor to int
        prob = probs[i]
        prob = float(prob)
        result.append([(x1, y1), (x2, y2), VOC_CLASSES[cls_index], image_name, prob])
    return result


if __name__ == '__main__':
    model = Yolo()
    print('load model...')
    model.load_state_dict(torch.load(r'E:\lyb\Yolov1\model_epoch_50.pth'))
    model.eval()
    model.cuda()
    image_name = r'E:\lyb\Yolov1\dog.jpg'
    image = cv2.imread(image_name)
    print('predicting...')
    result = predict_gpu(model, image_name)
    print('result: ', result)
    for left_up, right_bottom, class_name, _, prob in result:
        color = Color[VOC_CLASSES.index(class_name)]
        cv2.rectangle(image, left_up, right_bottom, color, 2)
        label = class_name + str(round(prob, 2))
        text_size, baseline = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.4, 1)
        p1 = (left_up[0], left_up[1] - text_size[1])
        cv2.rectangle(image, (p1[0] - 2 // 2, p1[1] - 2 - baseline), (p1[0] + text_size[0], p1[1] + text_size[1]),
                      color, -1)
        cv2.putText(image, label, (p1[0], p1[1] + baseline), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (255, 255, 255), 1, 8)

    cv2.imwrite('result.jpg', image)