yolov8head解析

原创

已于 2024-09-18 15:52:32 修改 · 2.4k 阅读

34 ·

CC 4.0 BY-SA版权

文章标签：

#YOLO

于 2024-09-11 16:46:58 首次发布

本人在研究yolov8时，发现它的检测头不同于yolov5(这里讨论官方版yolo,即ultralytics版)，但网上的一些帖子都讲的不太详细，故开此贴讨论一下。

首先我们要明确yolov8在对检测框回归时，采用了分布焦点损失Distribution Focal Loss(DFL)，听起来十分高大上，但实际上原理并不复杂。通过调试yolov8代码，我们可以看到：

class Detect(nn.Module):
    """YOLOv8 Detect head for detection models."""

    dynamic = False  # force grid reconstruction
    export = False  # export mode
    shape = None
    anchors = torch.empty(0)  # init
    strides = torch.empty(0)  # init

    def __init__(self, nc=80, ch=()):
        """Initializes the YOLOv8 detection layer with specified number of classes and channels."""
        super().__init__()
        self.nc = nc  # number of classes
        self.nl = len(ch)  # number of detection layers
        self.reg_max = 16  # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)
        self.no = nc + self.reg_max * 4  # number of outputs per anchor
        self.stride = torch.zeros(self.nl)  # strides computed during build
        c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))  # channels
        self.cv2 = nn.ModuleList(
            nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch
        )
        self.cv3 = nn.ModuleList(nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1)) for x in ch)
        self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()

    def forward(self, x):
        """Concatenates and returns predicted bounding boxes and class probabilities."""
        for i in range(self.nl):
            x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)
        if self.training:  # Training path
            return x
# 余下代码略

最终输出的通道数是16 * 4,因为self.cv2控制的是边界框回归，self.cv3控制的是分类回归(这里假设你已经有深度学习和yolo基础)。也就是说，当特征图通过检测头的时候，会被转换为形状为(batch_size, 16 *4, w, h),而yolov8默认开启3个尺度的回归，所以前向传播会返回一个长度为3的列表，其中每个列表的形状是(batch_size, 16 *4, wi, hi)i=0,1,2

这里可以理解为特征图大小为wi * hi，每个像素生成64个回归信息。

接着是一些数据处理步骤，后续有兴趣再更，我们直接看重点：

class v8DetectionLoss:
    """Criterion class for computing training losses."""
.......
.......

    def __call__(self, preds, batch):
        """Calculate the sum of the loss for box, cls and dfl multiplied by batch size."""
        loss = torch.zeros(3, device=self.device)  # box, cls, dfl
        feats = preds[1] if isinstance(preds, tuple) else preds
        pred_distri, pred_scores = torch.cat([xi.view(feats[0].shape[0], self.no, -1) for xi in feats], 2).split(
            (self.reg_max * 4, self.nc), 1
        )

        pred_scores = pred_scores.permute(0, 2, 1).contiguous()
        pred_distri = pred_distri.permute(0, 2, 1).contiguous()

        dtype = pred_scores.dtype
        batch_size = pred_scores.shape[0]
        imgsz = torch.tensor(feats[0].shape[2:], device=self.device, dtype=dtype) * self.stride[0]  # image size (h,w)
        anchor_points, stride_tensor = make_anchors(feats, self.stride, 0.5)

        # Targets
        targets = torch.cat((batch["batch_idx"].view(-1, 1), batch["cls"].view(-1, 1), batch["bboxes"]), 1)
        targets = self.preprocess(targets.to(self.device), batch_size, scale_tensor=imgsz[[1, 0, 1, 0]])
        gt_labels, gt_bboxes = targets.split((1, 4), 2)  # cls, xyxy
        mask_gt = gt_bboxes.sum(2, keepdim=True).gt_(0)

        # Pboxes
        pred_bboxes = self.bbox_decode(anchor_points, pred_distri)  # xyxy, (b, h*w, 4)

        _, target_bboxes, target_scores, fg_mask, _ = self.assigner(
            pred_scores.detach().sigmoid(),
            (pred_bboxes.detach() * stride_tensor).type(gt_bboxes.dtype),
            anchor_points * stride_tensor,
            gt_labels,
            gt_bboxes,
            mask_gt,
        )

        target_scores_sum = max(target_scores.sum(), 1)

        # Cls loss
        # loss[1] = self.varifocal_loss(pred_scores, target_scores, target_labels) / target_scores_sum  # VFL way
        loss[1] = self.bce(pred_scores, target_scores.to(dtype)).sum() / target_scores_sum  # BCE

        # Bbox loss
        if fg_mask.sum():
            target_bboxes /= stride_tensor
            loss[0], loss[2] = self.bbox_loss(
                pred_distri, pred_bboxes, anchor_points, target_bboxes, target_scores, target_scores_sum, fg_mask, target_bboxes*stride_tensor
            )

        loss[0] *= self.hyp.box  # box gain
        loss[1] *= self.hyp.cls  # cls gain
        loss[2] *= self.hyp.dfl  # dfl gain

        return loss.sum() * batch_size, loss.detach()  # loss(box, cls, dfl)

其中feats代表的是3个尺度的经过计算的特征图，这里贴上我调试的数据：

可以看到形状是(batch_size, reg_max * 4(64) + num_classes(11), hi, wi),然后经过这行代码：

        pred_distri, pred_scores = torch.cat([xi.view(feats[0].shape[0], self.no, -1) for xi in feats], 2).split(
            (self.reg_max * 4, self.nc), 1
        )

会将feats重塑为(batch_size, self.no(64 + 11), -1),这里-1代表自适应，即hi * wi，由于feats的长度为3，所以我们cat后得到的形状(batch_size, self.no(64 + 11), 80*80 + 40*40 +20*20 = 8400),然后再从第二个维度进行分离，即得到的pred_distri, pred_scores分别是(batch_size, 64, 8400)和(batch_sizes, num_classes, 8400)，再将形状进行调整：

        pred_scores = pred_scores.permute(0, 2, 1).contiguous()
        pred_distri = pred_distri.permute(0, 2, 1).contiguous()

得到

这里表明我们每个batch的数据生成了8400个预测，其中box预测的维度是64，类别的预测维度是11,到这里就比较清晰了，接下来做的事情就是将box的64维度表示转换成我们目标检测常用的四维表示法xywh或者xyxy:

anchor_points, stride_tensor = make_anchors(feats, self.stride, 0.5)

def make_anchors(feats, strides, grid_cell_offset=0.5):
    """Generate anchors from features."""
    anchor_points, stride_tensor = [], []
    assert feats is not None
    dtype, device = feats[0].dtype, feats[0].device
    for i, stride in enumerate(strides):
        _, _, h, w = feats[i].shape
        sx = torch.arange(end=w, devic

最低0.47元/天解锁文章

4 条评论

wjy000918 2025.01.19
up你知道Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)这三个卷积，中间这个有什么作用吗，我看除了通道数不一样，剩下的都一样

zj09130913 2024.10.21
讲得很好，代码理解细致。

FrankIcy 2024.10.09
讲的已经很厉害了，但鄙人愚钝，看起来还是很有难度，希望大佬能把整个过程梳理一下，列个表或者图，能再讲详细点吗，谢谢
- 随便学学346667回复FrankIcy 2024.10.11
  非常感谢对本文的建议，但博主最近手头比较忙，你可以通过一句一句调试代码，梳理一下数据的流程，这样就比较清晰了，如果后续有空，我考虑更新一下