yolo训练参数scale和multi-scale的区别

yolov5/v7训练时,有两个和多尺度有关的参数,一个是scale, 另一个是multi-scale(yolov8去掉了这个)。
其中scale在超参数配置文件中设置:
image.png
multi-scale在训练脚本中设置:
image.png
那么这两个参数有什么区别呢?
首先我们看看代码中使用它们的地方。
scale
scale在datasets.py的random_perspective中使用,作用是缩放图像(透视变换中的尺度)。

    R = np.eye(3)
    a = random.uniform(-degrees, degrees)
    # a += random.choice([-180, -90, 0, 90])  # add 90deg rotations to small rotations
    s = random.uniform(1 - scale, 1.1 + scale)
    # s = 2 ** random.uniform(-scale, scale)
    R[:2] = cv2.getRotationMatrix2D(angle=a, center=(0, 0), scale=s)
    ...

    # Combined rotation matrix
    M = T @ S @ R @ P @ C  # order of operations (right to left) is IMPORTANT
    if (border[0] != 0) or (border[1] != 0) or (M != np.eye(3)).any():  # image changed
        if perspective:
            img = cv2.warpPerspective(img, M, dsize=(width, height), borderValue=(114, 114, 114))
        else:  # affine
            img = cv2.warpAffine(img, M[:2], dsize=(width, height), borderValue=(114, 114, 114))

在对图像进行变换后,还要对目标的框信息进行变换:

xy = np.ones((n * 4, 3))
            xy[:, :2] = targets[:, [1, 2, 3, 4, 1, 4, 3, 2]].reshape(n * 4, 2)  # x1y1, x2y2, x1y2, x2y1
            xy = xy @ M.T  # transform
            xy = (xy[:, :2] / xy[:, 2:3] if perspective else xy[:, :2]).reshape(n, 8)  # perspective rescale or affine

            # create new boxes
            x = xy[:, [0, 2, 4, 6]]
            y = xy[:, [1, 3, 5, 7]]
            new = np.concatenate((x.min(1), y.min(1), x.max(1), y.max(1))).reshape(4, n).T

            # clip
            new[:, [0, 2]] = new[:, [0, 2]].clip(0, width)
            new[:, [1, 3]] = new[:, [1, 3]].clip(0, height)

scale是在dataset读取时发生作用,得到的图像的大小等于网络的输入大小,目标的大小(占比)变化了。
image.png

multi scale
multi scale在train.py中,在从dataset中读取出数据后进行。

# Multi-scale
            if opt.multi_scale:
                sz = random.randrange(imgsz * 0.5, imgsz * 1.5 + gs) // gs * gs  # size
                sf = sz / max(imgs.shape[2:])  # scale factor
                if sf != 1:
                    ns = [math.ceil(x * sf / gs) * gs for x in imgs.shape[2:]]  # new shape (stretched to gs-multiple)
                    imgs = F.interpolate(imgs, size=ns, mode='bilinear', align_corners=False)

得到的图像大小不一定等于网络的输入大小,相当于使用多种输入大小训练网络。这里没有对框进行缩放,是因为框坐标本身是归一化的,图像缩放不影响目标的占比,所以不需要对框进行处理。

结语

scale和multi scale是yolo中的2个尺度相关的参数,不过yolov8把multi scale去掉了,github上项目方说不建议使用这个参数训练。

### Multi-head Latent Attention Mechanism in YOLO Object Detection Model Incorporating multi-head latent attention mechanisms into the YOLO object detection framework enhances feature extraction and context understanding within images. This approach allows for more robust identification of objects by focusing on relevant regions while suppressing noise or irrelevant information. The integration of such an attention mechanism can be achieved through several modifications to the original architecture: #### Feature Map Enhancement By applying a multi-head self-attention layer after each convolutional block, deeper interactions between spatial positions are captured. Each head learns different aspects of dependencies across locations, leading to richer representations that better capture complex patterns present in real-world scenes[^1]. ```python class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super(MultiHeadAttention, self).__init__() assert d_model % num_heads == 0 self.d_k = d_model // num_heads self.num_heads = num_heads self.W_q = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) def forward(self, Q, K, V, mask=None): batch_size = Q.size(0) # Linear projections q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) v = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) # Scaled dot-product attention scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) attn_weights = F.softmax(scores, dim=-1) output = torch.matmul(attn_weights, v).transpose(1, 2).contiguous().view(batch_size, -1, self.d_model) return output ``` This method improves upon traditional CNN-based approaches where only local receptive fields contribute directly to activations at higher layers. Instead, every position has access to global contextual cues via learned weighted sums over all other positions' features. #### Contextual Information Aggregation To further strengthen interdependencies among detected entities, cross-scale fusion techniques may also incorporate this type of attention module. By aggregating multi-level semantic knowledge from various scales simultaneously, performance gains become evident especially when dealing with occlusions or cluttered backgrounds common in practical applications like autonomous driving systems. Despite these advancements, challenges remain regarding computational efficiency due to increased parameter counts associated with additional modules as well as potential difficulties during training caused by vanishing gradients problems inherent in deep networks.
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

CodingInCV

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值