本人在研究yolov8时,发现它的检测头不同于yolov5(这里讨论官方版yolo,即ultralytics版),但网上的一些帖子都讲的不太详细,故开此贴讨论一下。
首先我们要明确yolov8在对检测框回归时,采用了分布焦点损失Distribution Focal Loss(DFL),听起来十分高大上,但实际上原理并不复杂。通过调试yolov8代码,我们可以看到:
class Detect(nn.Module):
"""YOLOv8 Detect head for detection models."""
dynamic = False # force grid reconstruction
export = False # export mode
shape = None
anchors = torch.empty(0) # init
strides = torch.empty(0) # init
def __init__(self, nc=80, ch=()):
"""Initializes the YOLOv8 detection layer with specified number of classes and channels."""
super().__init__()
self.nc = nc # number of classes
self.nl = len(ch) # number of detection layers
self.reg_max = 16 # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)
self.no = nc + self.reg_max * 4 # number of outputs per anchor
self.stride = torch.zeros(self.nl) # strides computed during build
c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100)) # channels
self.cv2 = nn.ModuleList(
nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch
)
self.cv3 = nn.ModuleList(nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1)) for x in ch)
self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()
def forward(self, x):
"""Concatenates and returns predicted bounding boxes and class probabilities."""
for i in range(self.nl):
x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)
if self.training: # Training path
return x
# 余下代码略
最终输出的通道数是16 * 4,因为self.cv2控制的是边界框回归,self.cv3控制的是分类回归(这里假设你已经有深度学习和yolo基础)。也就是说,当特征图通过检测头的时候,会被转换为形状为(batch_size, 16 *4, w, h),而yolov8默认开启3个尺度的回归,所以前向传播会返回一个长度为3的列表,其中每个列表的形状是(batch_size, 16 *4, wi, hi)i=0,1,2
这里可以理解为特征图大小为wi * hi,每个像素生成64个回归信息。
接着是一些数据处理步骤,后续有兴趣再更,我们直接看重点:
class v8DetectionLoss:
"""Criterion class for computing training losses."""
.......
.......
def __call__(self, preds, batch):
"""Calculate the sum of the loss for box, cls and dfl multiplied by batch size."""
loss = torch.zeros(3, device=self.device) # box, cls, dfl
feats = preds[1] if isinstance(preds, tuple) else preds
pred_distri, pred_scores = torch.cat([xi.view(feats[0].shape[0], self.no, -1) for xi in feats], 2).split(
(self.reg_max * 4, self.nc), 1
)
pred_scores = pred_scores.permute(0, 2, 1).contiguous()
pred_distri = pred_distri.permute(0, 2, 1).contiguous()
dtype = pred_scores.dtype
batch_size = pred_scores.shape[0]
imgsz = torch.tensor(feats[0].shape[2:], device=self.device, dtype=dtype) * self.stride[0] # image size (h,w)
anchor_points, stride_tensor = make_anchors(feats, self.stride, 0.5)
# Targets
targets = torch.cat((batch["batch_idx"].view(-1, 1), batch["cls"].view(-1, 1), batch["bboxes"]), 1)
targets = self.preprocess(targets.to(self.device), batch_size, scale_tensor=imgsz[[1, 0, 1, 0]])
gt_labels, gt_bboxes = targets.split((1, 4), 2) # cls, xyxy
mask_gt = gt_bboxes.sum(2, keepdim=True).gt_(0)
# Pboxes
pred_bboxes = self.bbox_decode(anchor_points, pred_distri) # xyxy, (b, h*w, 4)
_, target_bboxes, target_scores, fg_mask, _ = self.assigner(
pred_scores.detach().sigmoid(),
(pred_bboxes.detach() * stride_tensor).type(gt_bboxes.dtype),
anchor_points * stride_tensor,
gt_labels,
gt_bboxes,
mask_gt,
)
target_scores_sum = max(target_scores.sum(), 1)
# Cls loss
# loss[1] = self.varifocal_loss(pred_scores, target_scores, target_labels) / target_scores_sum # VFL way
loss[1] = self.bce(pred_scores, target_scores.to(dtype)).sum() / target_scores_sum # BCE
# Bbox loss
if fg_mask.sum():
target_bboxes /= stride_tensor
loss[0], loss[2] = self.bbox_loss(
pred_distri, pred_bboxes, anchor_points, target_bboxes, target_scores, target_scores_sum, fg_mask, target_bboxes*stride_tensor
)
loss[0] *= self.hyp.box # box gain
loss[1] *= self.hyp.cls # cls gain
loss[2] *= self.hyp.dfl # dfl gain
return loss.sum() * batch_size, loss.detach() # loss(box, cls, dfl)
其中feats代表的是3个尺度的经过计算的特征图,这里贴上我调试的数据:

可以看到形状是(batch_size, reg_max * 4(64) + num_classes(11), hi, wi),然后经过这行代码:
pred_distri, pred_scores = torch.cat([xi.view(feats[0].shape[0], self.no, -1) for xi in feats], 2).split(
(self.reg_max * 4, self.nc), 1
)
会将feats重塑为(batch_size, self.no(64 + 11), -1),这里-1代表自适应,即hi * wi,由于feats的长度为3,所以我们cat后得到的形状(batch_size, self.no(64 + 11), 80*80 + 40*40 +20*20 = 8400),然后再从第二个维度进行分离,即得到的pred_distri, pred_scores分别是(batch_size, 64, 8400)和(batch_sizes, num_classes, 8400),再将形状进行调整:
pred_scores = pred_scores.permute(0, 2, 1).contiguous()
pred_distri = pred_distri.permute(0, 2, 1).contiguous()
得到
这里表明我们每个batch的数据生成了8400个预测,其中box预测的维度是64,类别的预测维度是11,到这里就比较清晰了,接下来做的事情就是将box的64维度表示转换成我们目标检测常用的四维表示法xywh或者xyxy:
anchor_points, stride_tensor = make_anchors(feats, self.stride, 0.5)
def make_anchors(feats, strides, grid_cell_offset=0.5):
"""Generate anchors from features."""
anchor_points, stride_tensor = [], []
assert feats is not None
dtype, device = feats[0].dtype, feats[0].device
for i, stride in enumerate(strides):
_, _, h, w = feats[i].shape
sx = torch.arange(end=w, devic

最低0.47元/天 解锁文章
3万+





