(已开源-ECCV24) 弱监督点云图像3D检测模型：Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

自动驾驶小学生

已于 2025-03-29 22:10:54 修改

阅读量423

点赞数 5

分类专栏：论文笔记文章标签：目标检测 KITTI ECCV2024 VG-W3D

于 2025-01-12 15:56:46 首次发布

本文链接：https://blog.youkuaiyun.com/cg129054036/article/details/145093783

版权

论文笔记专栏收录该内容

72 篇文章

订阅专栏

在这里插入图片描述

本文介绍一篇点云图像弱监督3D物体检测模型：VG-W3D。弱监督3D目标检测旨在学习一个标注成本低的3D物体检测器，例如使用2D label。与之前的工作仍然依赖于较少准确的3D标注框不同，本文提出了一个新的方法来研究如何在不需要任何3D标注的情况下利用2D和3D域之间的约束。 具体来说，本文从三个角度使用视觉数据来建立2D和3D域之间的连接：

首先，本文设计了一个特征级约束来对齐目标感知区域的 LiDAR 和图像特征。
其次，开发了输出级约束来强制约束2D和3D投影框之间的重叠。
最后，通过生成对齐视觉数据的一致的3D伪标签来实现训练约束。

本文对 KITTI 数据集进行了广泛的实验，以验证所提出的三个约束的有效性。在不使用任何 3 D标注的情况下，本文的方法与最先进的方法相比取得了良好的性能提升。

项目链接：https://github.com/kuanchihhuang/VG-W3D?tab=readme-ov-file

文章目录

Introduction

图一中，作者从三个角度研究了视觉指导的学习过程：特征级别的目标学习、输出级别的响应学习和训练级别的伪标签学习。

特征级指导。对于良好标定的图像和激光雷达，从图像中获得的目标预测应该与激光雷达数据中的相应区域对齐。例如，当点云被 3D 检测器识别为前景对象时，其在图像平面上的相应投影像素应该与 2D 检测器做出的类别预测对齐。
输出级指导。注意到2D框和3D边界框在图像平面上的投影有很大的重叠。基于这一现象，建立了一个不同的 2D-3D 约束来指导 3D 候选框的监督。
训练级指导。由于点云数据的稀疏性，通过非学习启发式使用初始3D标签可能含有许多噪声，并且会缺失目标物。因此，迭代细化这些标签以提高准确性至关重要。另一个挑战是减少生成的假阳性伪标签。因为在训练过程中，模型很容易产生高置信度分数的意外估计。因此，我们通过将视觉域中的 2D 框的预测分数集成到伪标签技术中，以确保 2D 和 3D 域中目标分数一致性。

在这里插入图片描述
表一是本文提出的研究方法与其它弱监督3D检测方法的对比，可以看出本文提出的方法不需要任何3D标注，仅需2D标注，本文提出的方法结合了多个视觉指导，特征级、输出级、训练级。

Proposed Approach

下图是本文提出的研究方法。VG-W3D 利用特征级、输出级和训练级线索来指导 3D 物体检测。为了获得用于训练的初始 3D 框，我们采用类似于 FGR 的非学习方法来识别物体的截锥点云，然后使用启发式算法来估计初始 3D 标签。在初始阶段，我们使用提供的 2D 标注框 $\hat{B}_I$ 训练一个 2D 检测器，以提取视觉特征 $F_I$ 并预测 2D 边界框 $B_I$ 及其相应的置信度分数 $\sigma_I$ 。随后，我们使用PointRCNN检测器来提取点云特征 $F_P$ 并生成 3D 边界框 $B_P$ ，和检测分数 $\sigma_P$ 。

在这里插入图片描述

Feature-Level Visual Guidance

下面介绍特征级视觉指导。如图3所示，图像特征 $F_I∈R^{H×W ×C}$ 和点云特征 $F_P∈R^{P ×C}$ 。最初使用摄像机标定参数将点云特征投影到图像平面上，得到投影点特征： $F_{P '} = Proj(F_P)∈R^{H×W×C}$ 。

一种直接的方法是强制点云特征模仿图像特征，使用L2损失：
$\mathcal{L}_{\text{feat}} = \frac{1}{|\mathcal{A}|} \sum_{i \in \mathcal{A}} \| \mathbf{F}_{\mathcal{I}}(i) - \mathbf{F}_{\mathcal{P}'}(i) \|_2$

其中 $\mathcal{A}$ 是图像中有效的像素区域。然而，这可能会损害点云特征学习的过程，因为图像特征不能提供比点云更多的几何信息。相反，我们建议强制图像和点云特征预测目标概率，以确保3D检测器可以通过特征级指导识别前景点。

我们采用分割图进行对象监督，只允许在目标区域进行指导。在不产生额外的注释成本的情况下，我们利用自监督分割方法在没有注释的情况下生成前景图。具体来说，在每个ground truth 2D边界框中，提取目标及其每个目标的前景图（使用DINO模型）。然后将这些单独的分割图合并以形成分割真值图 $S ∈ R^{H×W}$ 。然后，我们利用分类器 $M_{P '}$ 将点云特征映射到二分类概率，通过Focal损失点云是否投影到目标物区域：
$\mathcal{L}_{\text{seg}}^{\mathcal{P}} = \frac{1}{|\mathcal{A}|} \sum_{i \in \mathcal{A}} \text{FL}(\mathbf{C}_{\mathcal{P}'}(i), \mathbf{S}(i)).$

另一方面，对于图像域，我们使用类似的方法来预测目标物区域：
$\mathcal{L}_{\text{seg}}^{\mathcal{I}} = \frac{1}{|\mathcal{A}|} \sum_{i \in \mathcal{A}} \text{FL}(\mathbf{C}_{\mathcal{I}}(i), \mathbf{S}(i)).$

为此，我们可以使用 KL 散度损失来强制点云模态中的目标物属性，并学习与图像模态相似的分布，而不会丢失来自点云的几何信息：
$\mathcal{L}_{\text{kl}} = \text{KL}(\mathbf{C}_{\mathcal{I}}||\mathbf{C}_{P'}).$
在这里插入图片描述
特征级指导损失代码为：

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.functional import grid_sample

def _sigmoid_cross_entropy_with_logits(logits, labels):
    loss = torch.clamp(logits, min=0) - logits * labels.type_as(logits)
    loss += torch.log1p(torch.exp(-torch.abs(logits)))
    return loss

class SigmoidFocalClassificationLoss(nn.Module):
    """Sigmoid focal cross entropy loss.
      Focal loss down-weights well classified examples and focusses on the hard
      examples. See https://arxiv.org/pdf/1708.02002.pdf for the loss definition.
    """
    def __init__(self, gamma=2.0, alpha=0.25):
        """Constructor.
        Args:
            gamma: exponent of the modulating factor (1 - p_t) ^ gamma.
            alpha: optional alpha weighting factor to balance positives vs negatives.
            all_zero_negative: bool. if True, will treat all zero as background.
            else, will treat first label as background. only affect alpha.
        """
        super().__init__()
        self._alpha = alpha
        self._gamma = gamma

    def forward(self,
                prediction_tensor,
                target_tensor,
                weights):
        """Compute loss function.

        Args:
            prediction_tensor: A float tensor of shape [batch_size, num_anchors,
              num_classes] representing the predicted logits for each class
            target_tensor: A float tensor of shape [batch_size, num_anchors,
              num_classes] representing one-hot encoded classification targets
            weights: a float tensor of shape [batch_size, num_anchors]
            class_indices: (Optional) A 1-D integer tensor of class indices.
              If provided, computes loss only for the specified class indices.

        Returns:
          loss: a float tensor of shape [batch_size, num_anchors, num_classes]
            representing the value of the loss function.
        """
        per_entry_cross_ent = (_sigmoid_cross_entropy_with_logits(
            labels=target_tensor, logits=prediction_tensor))
        prediction_probabilities = torch.sigmoid(prediction_tensor)
        p_t = ((target_tensor * prediction_probabilities) +
               ((1 - target_tensor) * (1 - prediction_probabilities)))
        modulating_factor = 1.0
        if self._gamma:
            modulating_factor = torch.pow(1.0 - p_t, self._gamma)
        alpha_weight_factor = 1.0
        if self._alpha is not None:
            alpha_weight_factor = (target_tensor * self._alpha + (1 - target_tensor) * (1 - self._alpha))

        focal_cross_entropy_loss = (modulating_factor * alpha_weight_factor * per_entry_cross_ent)
        return focal_cross_entropy_loss * weights



class FeatureLevelLoss(nn.Module):
    def __init__(self):
        super(FeatureLevelLoss, self).__init__()
        self.seg_loss = SigmoidFocalClassificationLoss()
    
    def forward(self, point_logit, img_logit, l_xy_norm, point_seg_label):
        """
        Args:
            point_logit: (B, N, 1)
            img_logit: ([B, 1, H, W])
            l_xy_coord: torch.Size([B, N, 2])
            point_seg_label: (B, N, 1)
        """
        B = point_logit.shape[0]
        l_xy_norm = l_xy_norm.unsqueeze(1)
        proj_img_logit = grid_sample(img_logit, l_xy_norm).squeeze(2)  # (B,C,1,N)
        proj_img_logit = F.softmax(proj_img_logit.view(B, -1), dim=-1)
        point_logit = F.softmax(point_logit.view(B, -1), dim=-1)
        kl_loss = F.kl_div(point_logit, proj_img_logit.detach(), reduction='none')

        pos = (point_seg_label > 0).float()
        seg_loss = self.seg_loss(point_logit.view(-1), point_seg_label.view(-1), pos.view(-1))

        return kl_loss.mean() + seg_loss.mean()

if __name__ == "__main__":
    #Sample code for batch_size = 4, number of point = 16384, image height=384, image_width=1280
    lidar_feat = torch.rand(4, 16384, 1).cuda()
    img_feat = torch.rand(4,1,384,1280).cuda()
    l_xy_norm = (torch.rand(4,16384,2) * 2 - 1).cuda() #the range should be -1~1, (x_pixel/(width-1)*2 -1, y_pixel/(height-1)*2 -1)
    seg_label = torch.randint(0, 1, (4,16384,1)).cuda()
    f_loss = FeatureLevelLoss()

    loss = f_loss(lidar_feat, img_feat, l_xy_norm, seg_label)
    print(loss)

Output-Level Visual Guidance

值得一提的是，使用2D和3D边界框检测到的任何目标都应该表现出高度的重叠，如图4所示。这意味着在3D边界框未知的弱监督学习场景中，我们可以利用2D真值框来监督3D检测器预测的3D框。

首先，给定预测的3D边界框 $B_P$ ，我们在3D坐标中获得其八个角，表示为 $C_3(B_P) ∈ R^{8×3}$ 。然后，我们使用已知的相机标定参数在图像中获得投影点 $C ∈ R^{8×2}$ 。接下来， $C$ 的边界框可以通过以下方式确定：
$(x_a, y_a) = \min(\mathcal{C}), \quad (x_b, y_b) = \max(\mathcal{C}).$

因此，我们利用相应的2D框预测 $B_I$ 来约束这两个框之间的差异：
$\mathcal{L}_{box} = \hat{\sigma}_{\mathcal{I}}(1 - \text{GIoU}(\mathbf{B}_{\mathcal{I}}, \mathbf{B}_{\text{proj}}))$

与正常IoU损失相比，GIoU损失可以更好地缓解投影3D框与真值2D框之间非重叠情况的消失梯度问题。

此外， $\hat{\sigma}_{\mathcal{I}} = \frac{\sigma_{\mathcal{I}}}{\sum_{i}^{N} \sigma_{\mathcal{I}_i}}$ 是同一场景中所有N个目标中每个预测2D框的归一化分数。我们引入预测分数作为每个框损失的权重。为此，所提出的输出级指导可以保证投影边界框与其 2D 对应物之间的精确对齐。

在这里插入图片描述输出级指导损失代码为：


from utils.calibration import Calibration
import numpy as np
import torch
import torch.nn as nn

def generalized_iou_loss(gt_bboxes, pr_bboxes, reduction='mean'):
    """
    gt_bboxes: tensor (-1, 4) xyxy
    pr_bboxes: tensor (-1, 4) xyxy
    loss proposed in the paper of giou
    """
    gt_area = (gt_bboxes[:, 2]-gt_bboxes[:, 0])*(gt_bboxes[:, 3]-gt_bboxes[:, 1])
    pr_area = (pr_bboxes[:, 2]-pr_bboxes[:, 0])*(pr_bboxes[:, 3]-pr_bboxes[:, 1])

    # iou
    lt = torch.max(gt_bboxes[:, :2], pr_bboxes[:, :2])
    rb = torch.min(gt_bboxes[:, 2:], pr_bboxes[:, 2:])
    TO_REMOVE = 1
    wh = (rb - lt + TO_REMOVE).clamp(min=0)
    inter = wh[:, 0] * wh[:, 1]
    union = gt_area + pr_area - inter
    iou = inter / union
    # enclosure
    lt = torch.min(gt_bboxes[:, :2], pr_bboxes[:, :2])
    rb = torch.max(gt_bboxes[:, 2:], pr_bboxes[:, 2:])
    wh = (rb - lt + TO_REMOVE).clamp(min=0)
    enclosure = wh[:, 0] * wh[:, 1]

    giou = iou - (enclosure-union)/enclosure
    loss = 1. - giou
    if reduction == 'mean':
        loss = loss.mean()
    elif reduction == 'sum':
        loss = loss.sum()
    elif reduction == 'none':
        pass
    return loss


def bbox_overlaps(box1, box2):
    """
    Implement the intersection over union (IoU) between box1 and box2 (x1, y1, x2, y2)

    Arguments:
    box1 -- tensor of shape (N, 4), first set of boxes
    box2 -- tensor of shape (K, 4), second set of boxes

    Returns:
    ious -- tensor of shape (N, K), ious between boxes
    """

    N = box1.size(0)
    K = box2.size(0)

    # when torch.max() takes tensor of different shape as arguments, it will broadcasting them.
    xi1 = torch.max(box1[:, 0].view(N, 1), box2[:, 0].view(1, K))
    yi1 = torch.max(box1[:, 1].view(N, 1), box2[:, 1].view(1, K))
    xi2 = torch.min(box1[:, 2].view(N, 1), box2[:, 2].view(1, K))
    yi2 = torch.min(box1[:, 3].view(N, 1), box2[:, 3].view(1, K))

    # we want to compare the compare the value with 0 elementwise. However, we can't
    # simply feed int 0, because it will invoke the function torch(max, dim=int) which is not
    # what we want.
    # To feed a tensor 0 of same type and device with box1 and box2
    # we use tensor.new().fill_(0)

    iw = torch.max(xi2 - xi1, box1.new(1).fill_(0))
    ih = torch.max(yi2 - yi1, box1.new(1).fill_(0))

    inter = iw * ih

    box1_area = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])
    box2_area = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])

    box1_area = box1_area.view(N, 1)
    box2_area = box2_area.view(1, K)

    union_area = box1_area + box2_area - inter

    ious = inter / union_area

    return ious


def boxes3d_to_corners3d_torch(boxes3d, flip = False):
    """
    :param boxes3d: (N, 7) [x, y, z, h, w, l, ry]
    :return: corners_rotated: (N, 8, 3)
    """
    boxes_num = boxes3d.shape[0]
    h, w, l, ry = boxes3d[:, 3:4], boxes3d[:, 4:5], boxes3d[:, 5:6], boxes3d[:, 6:7]
    if flip:
        ry = ry + np.pi
    centers = boxes3d[:, 0:3]
    zeros = torch.cuda.FloatTensor(boxes_num, 1).fill_(0)
    ones = torch.cuda.FloatTensor(boxes_num, 1).fill_(1)

    x_corners = torch.cat([l / 2., l / 2., -l / 2., -l / 2., l / 2., l / 2., -l / 2., -l / 2.], dim = 1)  # (N, 8)
    y_corners = torch.cat([zeros, zeros, zeros, zeros, -h, -h, -h, -h], dim = 1)  # (N, 8)
    z_corners = torch.cat([w / 2., -w / 2., -w / 2., w / 2., w / 2., -w / 2., -w / 2., w / 2.], dim = 1)  # (N, 8)
    corners = torch.cat((x_corners.unsqueeze(dim = 1), y_corners.unsqueeze(dim = 1), z_corners.unsqueeze(dim = 1)),
                        dim = 1)  # (N, 3, 8)

    cosa, sina = torch.cos(ry), torch.sin(ry)
    raw_1 = torch.cat([cosa, zeros, sina], dim = 1)
    raw_2 = torch.cat([zeros, ones, zeros], dim = 1)
    raw_3 = torch.cat([-sina, zeros, cosa], dim = 1)
    R = torch.cat((raw_1.unsqueeze(dim = 1), raw_2.unsqueeze(dim = 1), raw_3.unsqueeze(dim = 1)), dim = 1)  # (N, 3, 3)

    corners_rotated = torch.matmul(R, corners)  # (N, 3, 8)
    corners_rotated = corners_rotated + centers.unsqueeze(dim = 2).expand(-1, -1, 8)
    corners_rotated = corners_rotated.permute(0, 2, 1)
    return corners_rotated

class OutputLevelLoss(nn.Module):
    def __init__(self):
        super(OutputLevelLoss, self).__init__()
        self.iou_thres = 0.3
    
    def forward(self, roi_box3d, boxes2d_label, calib, image_size=(384,1280)):
        
        img_h, img_w = image_size
        corners3d_torch = boxes3d_to_corners3d_torch(roi_box3d)
        boxes2d_pred, _ = calib.corners3d_to_img_boxes_torch(corners3d_torch)
        boxes2d_pred[:,0].clamp_(min=0, max=img_w - 1)
        boxes2d_pred[:,1].clamp_(min=0, max=img_h - 1)
        boxes2d_pred[:,2].clamp_(min=0, max=img_w - 1)
        boxes2d_pred[:,3].clamp_(min=0, max=img_h - 1)
        
        overlaps = bbox_overlaps(boxes2d_pred, boxes2d_label)                
        max_overlap, argmax_overlap = torch.max(overlaps, 1)

        fg_inds = torch.nonzero(max_overlap >= self.iou_thres).view(-1)
                
        sampled_rois = boxes2d_pred[fg_inds]
        sampled_gts = boxes2d_label[argmax_overlap[fg_inds]]
        

        if sampled_gts.shape[0] != 0:       
            weak_iou_loss = generalized_iou_loss(sampled_rois, sampled_gts)
        else:
            weak_iou_loss = torch.tensor([0])
        return weak_iou_loss

if __name__ == "__main__":

    img_w = 1280
    img_h = 384

    label_file = "data/label/001264.txt"
    calib_file = "data/calib/001264.txt"
    roi_file = "data/roi/rois.pt"

    calib = Calibration(calib_file)
    rois = torch.load(roi_file)

    with open(label_file, 'r') as f:
        lines = f.readlines()

    box2d_list = []

    for line in lines:
        label = line.strip().split(' ')
        box2d = np.array((float(label[4]), float(label[5]), float(label[6]), float(label[7])), dtype = np.float32)
        if label[0] == 'Car':
            box2d_list.append(box2d)
    boxes2d_label = torch.tensor(box2d_list).cuda()

    out_loss = OutputLevelLoss()
    loss = out_loss(rois,boxes2d_label,calib)
    print(loss)

Training-Level Visual Guidance

提供直接监督信号的一种常见方法是使用伪标签。我们最初考虑在非学习的情况下生成3D伪标签。然而，它们容易产生噪声，可能会漏标许多物体。例如使用非学习的方法，在生成的训练数据集(3712帧)中，只有大约2700帧包含伪标签。此外，伪标签可能会引入额外的假阳性，这可能会对自训练产生负面影响。为了解决这些问题，我们引入了一种图像引导的方法来生成高质量的伪标签，如算法 1 所示。

在这里插入图片描述
训练级指导代码为：

import numpy as np
from scipy.optimize import linear_sum_assignment

def area(boxes, add1=False):
    """Computes area of boxes.

    Args:
        boxes: Numpy array with shape [N, 4] holding N boxes

    Returns:
        a numpy array with shape [N*1] representing box areas
    """
    if add1:
        return (boxes[:, 2] - boxes[:, 0] + 1.0) * (
            boxes[:, 3] - boxes[:, 1] + 1.0)
    else:
        return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])

def intersection(boxes1, boxes2, add1=False):
    """Compute pairwise intersection areas between boxes.

    Args:
        boxes1: a numpy array with shape [N, 4] holding N boxes
        boxes2: a numpy array with shape [M, 4] holding M boxes

    Returns:
        a numpy array with shape [N*M] representing pairwise intersection area
    """
    [y_min1, x_min1, y_max1, x_max1] = np.split(boxes1, 4, axis=1)
    [y_min2, x_min2, y_max2, x_max2] = np.split(boxes2, 4, axis=1)

    all_pairs_min_ymax = np.minimum(y_max1, np.transpose(y_max2))
    all_pairs_max_ymin = np.maximum(y_min1, np.transpose(y_min2))
    if add1:
        all_pairs_min_ymax += 1.0
    intersect_heights = np.maximum(
        np.zeros(all_pairs_max_ymin.shape),
        all_pairs_min_ymax - all_pairs_max_ymin)

    all_pairs_min_xmax = np.minimum(x_max1, np.transpose(x_max2))
    all_pairs_max_xmin = np.maximum(x_min1, np.transpose(x_min2))
    if add1:
        all_pairs_min_xmax += 1.0
    intersect_widths = np.maximum(
        np.zeros(all_pairs_max_xmin.shape),
        all_pairs_min_xmax - all_pairs_max_xmin)
    return intersect_heights * intersect_widths


def iou(boxes1, boxes2, add1=False):
    """Computes pairwise intersection-over-union between box collections.

    Args:
        boxes1: a numpy array with shape [N, 4] holding N boxes.
        boxes2: a numpy array with shape [M, 4] holding N boxes.

    Returns:
        a numpy array with shape [N, M] representing pairwise iou scores.
    """
    intersect = intersection(boxes1, boxes2, add1)
    area1 = area(boxes1, add1)
    area2 = area(boxes2, add1)
    union = np.expand_dims(
        area1, axis=1) + np.expand_dims(
            area2, axis=0) - intersect
    return intersect / union

if __name__ == "__main__":


    #load sample data
    
    alpha0 = 0.5
    alpha1 = 0.6
    alpha2 = 0.8

    import pickle
    with open("data/sample_data.pkl", 'rb') as f:
        data = pickle.load(f)

    img_box = data["img_box"]   # (N*4)  (y_min, x_min, y_max, x_max) 
    proj_lidar_box = data["proj_lidar_box"]  # (N*4) projected 2d box from 3d box   
    img_score = data["img_score"] # (N*1)   
    proj_lidar_score = data["proj_lidar_score"] # (N*1)

    ious = iou(proj_lidar_box, img_box)
    row_ind, col_ind = linear_sum_assignment(-ious)
        
    #(row_ind, col_ind) = match_projected_idx, match_img_idx
    new_col_ind = []
    for i in range(len(row_ind)):
        if ious[row_ind[i],col_ind[i]] > alpha0:
            new_col_ind.append(col_ind[i])

    index_list =[] 

    for i in range(len(row_ind)):
        if (proj_lidar_score[row_ind[i]] + img_score[col_ind[i]])/2. > alpha1:
            index_list.append(row_ind[i])

    #unmatched but with high confidence score
    for i in range(len(proj_lidar_score)):
        if i in index_list:
            continue
        if proj_lidar_score[i] > alpha2:
            index_list.append(i)

    #final kept index for lidar box
    print(index_list)

Experiments

下面是实验部分，首先是KITTI测试集结果对比。与全监督的PointRCNN相比，VG-W3D在不使用任何3D注释的情况下实现了相近的性能，证明了方法的有效性。此外，弱监督基线FGR相比，我们的方法在 $AP_{3D}$ 中分别提高了3.83/5.81/6.33，这表明了所提出的视觉引导方法的有效性。此外，与需要3D或BEV中心注释的方法相比，我们的方法优于其他方法。

在这里插入图片描述
KITTI验证集上，与需要500帧3D注释的MTrans相比，VG-W3D在大多数指标上都取得了更好的结果，这验证了我们方法的有效性。此外，与仅利用类别标签对提案进行分类的VS3D[和WS3DPR相比，我们的方法可以从多个角度有效地指导学习，从而显著提高性能。

在这里插入图片描述
下面是消融实验。在表4中，我们展示了不同级别的视觉指导的有效性。我们复现的FGR基线只实现了较好的性能，因为非学习方法的初始3D注释有噪声。无论是特征级、输出级还是训练级指导，都能提高baseline性能。
在这里插入图片描述
我们研究了表5中提出的特征级指导的有效性（是特征级和输出级损失一起使用）。在第一行中，当不学习从图像域学习到的目标物概率时，可以观察到性能下降。此外，将KL散度损失替换为L2损失并尝试模仿特征而不是目标预测会产生不满意的结果。最后，使用2D边界框作为前景监督的掩码是无效的，因为它无法为点云提供有价值的信号，例如框内的某些区域可能属于背景。使用分割图学习目标物前景区域和KL散度将有利于特征级指导。

在表6中，比较了输出级指导的各种选项。可以利用L1损失来学习从预训练图像检测器预测的 3D框的角点，而不是回归 2D 和 3D 边界框之间的 IoU。然而，由于初始 3D 框注释的噪声性质，这种方法效果较差，影响了图像检测器学习角点作为指导的能力。此外，我们的结果表明，GIoU损失比IoU损失产生更好的性能，因为投影的3D框通常与 2D 框不完全重叠。

在这里插入图片描述
表7中可以看到，添加重叠投影3D框和2D真值框的Overlap略微有助于训练。最后，集成我们的两个训练级指导（即Boverlap + Bscore）保留了高置信度分数的预测，使我们能够识别图像域中缺少注释的更多对象。这显着改善了伪标注过程。

在表8中，我们研究了初始伪标签和我们提出的伪标签方法的质量。通过非学习方法生成的 3D 标签通常包含噪声和缺失的信息，特别是当截锥点云缺乏明确的分离来准确定位边界框时。因此，召回率在 IoU=0.7 时仅达到 46.71%。在使用我们的方法进行第一轮训练后，伪标签的质量提高了大约 25% 的召回率，表明了我们方法的有效性。最后，伪标签的召回率进在第二轮训练后饱和。

我们的方法能够与现有的预训练的2D目标检测器结合，如表9所示。使用在COCO数据集上训练的DETR目标检测器，我们可以直接使用它来检测KITTI数据集上的目标并生成2D框。最初，我们使用FGR生成3D边界框并训练我们的模型，建立基线结果。随后，我们应用我们提出的视觉引导方法。

在这里插入图片描述