本文介绍一篇点云图像弱监督3D物体检测模型:VG-W3D。弱监督3D目标检测旨在学习一个标注成本低的3D物体检测器,例如使用2D label。与之前的工作仍然依赖于较少准确的3D标注框不同,本文提出了一个新的方法来研究如何在不需要任何3D标注的情况下利用2D和3D域之间的约束。 具体来说,本文从三个角度使用视觉数据来建立2D和3D域之间的连接:
- 首先,本文设计了一个特征级约束来对齐目标感知区域的 LiDAR 和图像特征。
- 其次,开发了输出级约束来强制约束2D和3D投影框之间的重叠。
- 最后,通过生成对齐视觉数据的一致的3D伪标签来实现训练约束。
本文对 KITTI 数据集进行了广泛的实验,以验证所提出的三个约束的有效性。在不使用任何 3 D标注的情况下,本文的方法与最先进的方法相比取得了良好的性能提升。
项目链接:https://github.com/kuanchihhuang/VG-W3D?tab=readme-ov-file
文章目录
Introduction
图一中,作者从三个角度研究了视觉指导的学习过程:特征级别的目标学习、输出级别的响应学习和训练级别的伪标签学习。
- 特征级指导。对于良好标定的图像和激光雷达,从图像中获得的目标预测应该与激光雷达数据中的相应区域对齐。例如,当点云被 3D 检测器识别为前景对象时,其在图像平面上的相应投影像素应该与 2D 检测器做出的类别预测对齐。
- 输出级指导。注意到2D框和3D边界框在图像平面上的投影有很大的重叠。基于这一现象,建立了一个不同的 2D-3D 约束来指导 3D 候选框的监督。
- 训练级指导。由于点云数据的稀疏性,通过非学习启发式使用初始3D标签可能含有许多噪声,并且会缺失目标物。因此,迭代细化这些标签以提高准确性至关重要。另一个挑战是减少生成的假阳性伪标签。因为在训练过程中,模型很容易产生高置信度分数的意外估计。因此,我们通过将视觉域中的 2D 框的预测分数集成到伪标签技术中,以确保 2D 和 3D 域中目标分数一致性。
表一是本文提出的研究方法与其它弱监督3D检测方法的对比,可以看出本文提出的方法不需要任何3D标注,仅需2D标注,本文提出的方法结合了多个视觉指导,特征级、输出级、训练级。
Proposed Approach
下图是本文提出的研究方法。VG-W3D 利用特征级、输出级和训练级线索来指导 3D 物体检测。为了获得用于训练的初始 3D 框,我们采用类似于 FGR 的非学习方法来识别物体的截锥点云,然后使用启发式算法来估计初始 3D 标签。在初始阶段,我们使用提供的 2D 标注框 B ^ I \hat{B}_I B^I 训练一个 2D 检测器,以提取视觉特征 F I F_I FI并预测 2D 边界框 B I B_I BI及其相应的置信度分数 σ I \sigma_I σI。随后,我们使用PointRCNN检测器来提取点云特征 F P F_P FP并生成 3D 边界框 B P B_P BP,和检测分数 σ P \sigma_P σP。
Feature-Level Visual Guidance
下面介绍特征级视觉指导。如图3所示,图像特征 F I ∈ R H × W × C F_I∈R^{H×W ×C} FI∈RH×W×C 和点云特征 F P ∈ R P × C F_P∈R^{P ×C} FP∈RP×C。最初使用摄像机标定参数将点云特征投影到图像平面上,得到投影点特征: F P ′ = P r o j ( F P ) ∈ R H × W × C F_{P '} = Proj(F_P)∈R^{H×W×C} FP′=Proj(FP)∈RH×W×C。
一种直接的方法是强制点云特征模仿图像特征,使用L2损失:
L
feat
=
1
∣
A
∣
∑
i
∈
A
∥
F
I
(
i
)
−
F
P
′
(
i
)
∥
2
\mathcal{L}_{\text{feat}} = \frac{1}{|\mathcal{A}|} \sum_{i \in \mathcal{A}} \| \mathbf{F}_{\mathcal{I}}(i) - \mathbf{F}_{\mathcal{P}'}(i) \|_2
Lfeat=∣A∣1i∈A∑∥FI(i)−FP′(i)∥2
其中 A \mathcal{A} A是图像中有效的像素区域。然而,这可能会损害点云特征学习的过程,因为图像特征不能提供比点云更多的几何信息。相反,我们建议强制图像和点云特征预测目标概率,以确保3D检测器可以通过特征级指导识别前景点。
我们采用分割图进行对象监督,只允许在目标区域进行指导。在不产生额外的注释成本的情况下,我们利用自监督分割方法在没有注释的情况下生成前景图。具体来说,在每个ground truth 2D边界框中,提取目标及其每个目标的前景图(使用DINO模型)。然后将这些单独的分割图合并以形成分割真值图
S
∈
R
H
×
W
S ∈ R^{H×W}
S∈RH×W。然后,我们利用分类器
M
P
′
M_{P '}
MP′将点云特征映射到二分类概率,通过Focal损失点云是否投影到目标物区域:
L
seg
P
=
1
∣
A
∣
∑
i
∈
A
FL
(
C
P
′
(
i
)
,
S
(
i
)
)
.
\mathcal{L}_{\text{seg}}^{\mathcal{P}} = \frac{1}{|\mathcal{A}|} \sum_{i \in \mathcal{A}} \text{FL}(\mathbf{C}_{\mathcal{P}'}(i), \mathbf{S}(i)).
LsegP=∣A∣1i∈A∑FL(CP′(i),S(i)).
另一方面,对于图像域,我们使用类似的方法来预测目标物区域:
L
seg
I
=
1
∣
A
∣
∑
i
∈
A
FL
(
C
I
(
i
)
,
S
(
i
)
)
.
\mathcal{L}_{\text{seg}}^{\mathcal{I}} = \frac{1}{|\mathcal{A}|} \sum_{i \in \mathcal{A}} \text{FL}(\mathbf{C}_{\mathcal{I}}(i), \mathbf{S}(i)).
LsegI=∣A∣1i∈A∑FL(CI(i),S(i)).
为此,我们可以使用 KL 散度损失来强制点云模态中的目标物属性,并学习与图像模态相似的分布,而不会丢失来自点云的几何信息:
L
kl
=
KL
(
C
I
∣
∣
C
P
′
)
.
\mathcal{L}_{\text{kl}} = \text{KL}(\mathbf{C}_{\mathcal{I}}||\mathbf{C}_{P'}).
Lkl=KL(CI∣∣CP′).
特征级指导损失代码为:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.functional import grid_sample
def _sigmoid_cross_entropy_with_logits(logits, labels):
loss = torch.clamp(logits, min=0) - logits * labels.type_as(logits)
loss += torch.log1p(torch.exp(-torch.abs(logits)))
return loss
class SigmoidFocalClassificationLoss(nn.Module):
"""Sigmoid focal cross entropy loss.
Focal loss down-weights well classified examples and focusses on the hard
examples. See https://arxiv.org/pdf/1708.02002.pdf for the loss definition.
"""
def __init__(self, gamma=2.0, alpha=0.25):
"""Constructor.
Args:
gamma: exponent of the modulating factor (1 - p_t) ^ gamma.
alpha: optional alpha weighting factor to balance positives vs negatives.
all_zero_negative: bool. if True, will treat all zero as background.
else, will treat first label as background. only affect alpha.
"""
super().__init__()
self._alpha = alpha
self._gamma = gamma
def forward(self,
prediction_tensor,
target_tensor,
weights):
"""Compute loss function.
Args:
prediction_tensor: A float tensor of shape [batch_size, num_anchors,
num_classes] representing the predicted logits for each class
target_tensor: A float tensor of shape [batch_size, num_anchors,
num_classes] representing one-hot encoded classification targets
weights: a float tensor of shape [batch_size, num_anchors]
class_indices: (Optional) A 1-D integer tensor of class indices.
If provided, computes loss only for the specified class indices.
Returns:
loss: a float tensor of shape [batch_size, num_anchors, num_classes]
representing the value of the loss function.
"""
per_entry_cross_ent = (_sigmoid_cross_entropy_with_logits(
labels=target_tensor, logits=prediction_tensor))
prediction_probabilities = torch.sigmoid(prediction_tensor)
p_t = ((target_tensor * prediction_probabilities) +
((1 - target_tensor) * (1 - prediction_probabilities)))
modulating_factor = 1.0
if self._gamma:
modulating_factor = torch.pow(1.0 - p_t, self._gamma)
alpha_weight_factor = 1.0
if self._alpha is not None:
alpha_weight_factor = (target_tensor * self._alpha + (1 - target_tensor) * (1 - self._alpha))
focal_cross_entropy_loss = (modulating_factor * alpha_weight_factor * per_entry_cross_ent)
return focal_cross_entropy_loss * weights
class FeatureLevelLoss(nn.Module):
def __init__(self):
super(FeatureLevelLoss, self).__init__()
self.seg_loss = SigmoidFocalClassificationLoss()
def forward(self, point_logit, img_logit, l_xy_norm, point_seg_label):
"""
Args:
point_logit: (B, N, 1)
img_logit: ([B, 1, H, W])
l_xy_coord: torch.Size([B, N, 2])
point_seg_label: (B, N, 1)
"""
B = point_logit.shape[0]
l_xy_norm = l_xy_norm.unsqueeze(1)
proj_img_logit = grid_sample(img_logit, l_xy_norm).squeeze(2) # (B,C,1,N)
proj_img_logit = F.softmax(proj_img_logit.view(B, -1), dim=-1)
point_logit = F.softmax(point_logit.view(B, -1), dim=-1)
kl_loss = F.kl_div(point_logit, proj_img_logit.detach(), reduction='none')
pos = (point_seg_label > 0).float()
seg_loss = self.seg_loss(point_logit.view(-1), point_seg_label.view(-1), pos.view(-1))
return kl_loss.mean() + seg_loss.mean()
if __name__ == "__main__":
#Sample code for batch_size = 4, number of point = 16384, image height=384, image_width=1280
lidar_feat = torch.rand(4, 16384, 1).cuda()
img_feat = torch.rand(4,1,384,1280).cuda()
l_xy_norm = (torch.rand(4,16384,2) * 2 - 1).cuda() #the range should be -1~1, (x_pixel/(width-1)*2 -1, y_pixel/(height-1)*2 -1)
seg_label = torch.randint(0, 1, (4,16384,1)).cuda()
f_loss = FeatureLevelLoss()
loss = f_loss(lidar_feat, img_feat, l_xy_norm, seg_label)
print(loss)
Output-Level Visual Guidance
值得一提的是,使用2D和3D边界框检测到的任何目标都应该表现出高度的重叠,如图4所示。这意味着在3D边界框未知的弱监督学习场景中,我们可以利用2D真值框来监督3D检测器预测的3D框。
首先,给定预测的3D边界框
B
P
B_P
BP,我们在3D坐标中获得其八个角,表示为
C
3
(
B
P
)
∈
R
8
×
3
C_3(B_P) ∈ R^{8×3}
C3(BP)∈R8×3。然后,我们使用已知的相机标定参数在图像中获得投影点
C
∈
R
8
×
2
C ∈ R^{8×2}
C∈R8×2。接下来,
C
C
C的边界框可以通过以下方式确定:
(
x
a
,
y
a
)
=
min
(
C
)
,
(
x
b
,
y
b
)
=
max
(
C
)
.
(x_a, y_a) = \min(\mathcal{C}), \quad (x_b, y_b) = \max(\mathcal{C}).
(xa,ya)=min(C),(xb,yb)=max(C).
因此,我们利用相应的2D框预测
B
I
B_I
BI来约束这两个框之间的差异:
L
b
o
x
=
σ
^
I
(
1
−
GIoU
(
B
I
,
B
proj
)
)
\mathcal{L}_{box} = \hat{\sigma}_{\mathcal{I}}(1 - \text{GIoU}(\mathbf{B}_{\mathcal{I}}, \mathbf{B}_{\text{proj}}))
Lbox=σ^I(1−GIoU(BI,Bproj))
与正常IoU损失相比,GIoU损失可以更好地缓解投影3D框与真值2D框之间非重叠情况的消失梯度问题。
此外, σ ^ I = σ I ∑ i N σ I i \hat{\sigma}_{\mathcal{I}} = \frac{\sigma_{\mathcal{I}}}{\sum_{i}^{N} \sigma_{\mathcal{I}_i}} σ^I=∑iNσIiσI 是同一场景中所有N个目标中每个预测2D框的归一化分数。我们引入预测分数作为每个框损失的权重。为此,所提出的输出级指导可以保证投影边界框与其 2D 对应物之间的精确对齐。
输出级指导损失代码为:
from utils.calibration import Calibration
import numpy as np
import torch
import torch.nn as nn
def generalized_iou_loss(gt_bboxes, pr_bboxes, reduction='mean'):
"""
gt_bboxes: tensor (-1, 4) xyxy
pr_bboxes: tensor (-1, 4) xyxy
loss proposed in the paper of giou
"""
gt_area = (gt_bboxes[:, 2]-gt_bboxes[:, 0])*(gt_bboxes[:, 3]-gt_bboxes[:, 1])
pr_area = (pr_bboxes[:, 2]-pr_bboxes[:, 0])*(pr_bboxes[:, 3]-pr_bboxes[:, 1])
# iou
lt = torch.max(gt_bboxes[:, :2], pr_bboxes[:, :2])
rb = torch.min(gt_bboxes[:, 2:], pr_bboxes[:, 2:])
TO_REMOVE = 1
wh = (rb - lt + TO_REMOVE).clamp(min=0)
inter = wh[:, 0] * wh[:, 1]
union = gt_area + pr_area - inter
iou = inter / union
# enclosure
lt = torch.min(gt_bboxes[:, :2], pr_bboxes[:, :2])
rb = torch.max(gt_bboxes[:, 2:], pr_bboxes[:, 2:])
wh = (rb - lt + TO_REMOVE).clamp(min=0)
enclosure = wh[:, 0] * wh[:, 1]
giou = iou - (enclosure-union)/enclosure
loss = 1. - giou
if reduction == 'mean':
loss = loss.mean()
elif reduction == 'sum':
loss = loss.sum()
elif reduction == 'none':
pass
return loss
def bbox_overlaps(box1, box2):
"""
Implement the intersection over union (IoU) between box1 and box2 (x1, y1, x2, y2)
Arguments:
box1 -- tensor of shape (N, 4), first set of boxes
box2 -- tensor of shape (K, 4), second set of boxes
Returns:
ious -- tensor of shape (N, K), ious between boxes
"""
N = box1.size(0)
K = box2.size(0)
# when torch.max() takes tensor of different shape as arguments, it will broadcasting them.
xi1 = torch.max(box1[:, 0].view(N, 1), box2[:, 0].view(1, K))
yi1 = torch.max(box1[:, 1].view(N, 1), box2[:, 1].view(1, K))
xi2 = torch.min(box1[:, 2].view(N, 1), box2[:, 2].view(1, K))
yi2 = torch.min(box1[:, 3].view(N, 1), box2[:, 3].view(1, K))
# we want to compare the compare the value with 0 elementwise. However, we can't
# simply feed int 0, because it will invoke the function torch(max, dim=int) which is not
# what we want.
# To feed a tensor 0 of same type and device with box1 and box2
# we use tensor.new().fill_(0)
iw = torch.max(xi2 - xi1, box1.new(1).fill_(0))
ih = torch.max(yi2 - yi1, box1.new(1).fill_(0))
inter = iw * ih
box1_area = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])
box2_area = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])
box1_area = box1_area.view(N, 1)
box2_area = box2_area.view(1, K)
union_area = box1_area + box2_area - inter
ious = inter / union_area
return ious
def boxes3d_to_corners3d_torch(boxes3d, flip = False):
"""
:param boxes3d: (N, 7) [x, y, z, h, w, l, ry]
:return: corners_rotated: (N, 8, 3)
"""
boxes_num = boxes3d.shape[0]
h, w, l, ry = boxes3d[:, 3:4], boxes3d[:, 4:5], boxes3d[:, 5:6], boxes3d[:, 6:7]
if flip:
ry = ry + np.pi
centers = boxes3d[:, 0:3]
zeros = torch.cuda.FloatTensor(boxes_num, 1).fill_(0)
ones = torch.cuda.FloatTensor(boxes_num, 1).fill_(1)
x_corners = torch.cat([l / 2., l / 2., -l / 2., -l / 2., l / 2., l / 2., -l / 2., -l / 2.], dim = 1) # (N, 8)
y_corners = torch.cat([zeros, zeros, zeros, zeros, -h, -h, -h, -h], dim = 1) # (N, 8)
z_corners = torch.cat([w / 2., -w / 2., -w / 2., w / 2., w / 2., -w / 2., -w / 2., w / 2.], dim = 1) # (N, 8)
corners = torch.cat((x_corners.unsqueeze(dim = 1), y_corners.unsqueeze(dim = 1), z_corners.unsqueeze(dim = 1)),
dim = 1) # (N, 3, 8)
cosa, sina = torch.cos(ry), torch.sin(ry)
raw_1 = torch.cat([cosa, zeros, sina], dim = 1)
raw_2 = torch.cat([zeros, ones, zeros], dim = 1)
raw_3 = torch.cat([-sina, zeros, cosa], dim = 1)
R = torch.cat((raw_1.unsqueeze(dim = 1), raw_2.unsqueeze(dim = 1), raw_3.unsqueeze(dim = 1)), dim = 1) # (N, 3, 3)
corners_rotated = torch.matmul(R, corners) # (N, 3, 8)
corners_rotated = corners_rotated + centers.unsqueeze(dim = 2).expand(-1, -1, 8)
corners_rotated = corners_rotated.permute(0, 2, 1)
return corners_rotated
class OutputLevelLoss(nn.Module):
def __init__(self):
super(OutputLevelLoss, self).__init__()
self.iou_thres = 0.3
def forward(self, roi_box3d, boxes2d_label, calib, image_size=(384,1280)):
img_h, img_w = image_size
corners3d_torch = boxes3d_to_corners3d_torch(roi_box3d)
boxes2d_pred, _ = calib.corners3d_to_img_boxes_torch(corners3d_torch)
boxes2d_pred[:,0].clamp_(min=0, max=img_w - 1)
boxes2d_pred[:,1].clamp_(min=0, max=img_h - 1)
boxes2d_pred[:,2].clamp_(min=0, max=img_w - 1)
boxes2d_pred[:,3].clamp_(min=0, max=img_h - 1)
overlaps = bbox_overlaps(boxes2d_pred, boxes2d_label)
max_overlap, argmax_overlap = torch.max(overlaps, 1)
fg_inds = torch.nonzero(max_overlap >= self.iou_thres).view(-1)
sampled_rois = boxes2d_pred[fg_inds]
sampled_gts = boxes2d_label[argmax_overlap[fg_inds]]
if sampled_gts.shape[0] != 0:
weak_iou_loss = generalized_iou_loss(sampled_rois, sampled_gts)
else:
weak_iou_loss = torch.tensor([0])
return weak_iou_loss
if __name__ == "__main__":
img_w = 1280
img_h = 384
label_file = "data/label/001264.txt"
calib_file = "data/calib/001264.txt"
roi_file = "data/roi/rois.pt"
calib = Calibration(calib_file)
rois = torch.load(roi_file)
with open(label_file, 'r') as f:
lines = f.readlines()
box2d_list = []
for line in lines:
label = line.strip().split(' ')
box2d = np.array((float(label[4]), float(label[5]), float(label[6]), float(label[7])), dtype = np.float32)
if label[0] == 'Car':
box2d_list.append(box2d)
boxes2d_label = torch.tensor(box2d_list).cuda()
out_loss = OutputLevelLoss()
loss = out_loss(rois,boxes2d_label,calib)
print(loss)
Training-Level Visual Guidance
提供直接监督信号的一种常见方法是使用伪标签。我们最初考虑在非学习的情况下生成3D伪标签。然而,它们容易产生噪声,可能会漏标许多物体。例如使用非学习的方法,在生成的训练数据集(3712帧)中,只有大约2700帧包含伪标签。此外,伪标签可能会引入额外的假阳性,这可能会对自训练产生负面影响。为了解决这些问题,我们引入了一种图像引导的方法来生成高质量的伪标签,如算法 1 所示。
训练级指导代码为:
import numpy as np
from scipy.optimize import linear_sum_assignment
def area(boxes, add1=False):
"""Computes area of boxes.
Args:
boxes: Numpy array with shape [N, 4] holding N boxes
Returns:
a numpy array with shape [N*1] representing box areas
"""
if add1:
return (boxes[:, 2] - boxes[:, 0] + 1.0) * (
boxes[:, 3] - boxes[:, 1] + 1.0)
else:
return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
def intersection(boxes1, boxes2, add1=False):
"""Compute pairwise intersection areas between boxes.
Args:
boxes1: a numpy array with shape [N, 4] holding N boxes
boxes2: a numpy array with shape [M, 4] holding M boxes
Returns:
a numpy array with shape [N*M] representing pairwise intersection area
"""
[y_min1, x_min1, y_max1, x_max1] = np.split(boxes1, 4, axis=1)
[y_min2, x_min2, y_max2, x_max2] = np.split(boxes2, 4, axis=1)
all_pairs_min_ymax = np.minimum(y_max1, np.transpose(y_max2))
all_pairs_max_ymin = np.maximum(y_min1, np.transpose(y_min2))
if add1:
all_pairs_min_ymax += 1.0
intersect_heights = np.maximum(
np.zeros(all_pairs_max_ymin.shape),
all_pairs_min_ymax - all_pairs_max_ymin)
all_pairs_min_xmax = np.minimum(x_max1, np.transpose(x_max2))
all_pairs_max_xmin = np.maximum(x_min1, np.transpose(x_min2))
if add1:
all_pairs_min_xmax += 1.0
intersect_widths = np.maximum(
np.zeros(all_pairs_max_xmin.shape),
all_pairs_min_xmax - all_pairs_max_xmin)
return intersect_heights * intersect_widths
def iou(boxes1, boxes2, add1=False):
"""Computes pairwise intersection-over-union between box collections.
Args:
boxes1: a numpy array with shape [N, 4] holding N boxes.
boxes2: a numpy array with shape [M, 4] holding N boxes.
Returns:
a numpy array with shape [N, M] representing pairwise iou scores.
"""
intersect = intersection(boxes1, boxes2, add1)
area1 = area(boxes1, add1)
area2 = area(boxes2, add1)
union = np.expand_dims(
area1, axis=1) + np.expand_dims(
area2, axis=0) - intersect
return intersect / union
if __name__ == "__main__":
#load sample data
alpha0 = 0.5
alpha1 = 0.6
alpha2 = 0.8
import pickle
with open("data/sample_data.pkl", 'rb') as f:
data = pickle.load(f)
img_box = data["img_box"] # (N*4) (y_min, x_min, y_max, x_max)
proj_lidar_box = data["proj_lidar_box"] # (N*4) projected 2d box from 3d box
img_score = data["img_score"] # (N*1)
proj_lidar_score = data["proj_lidar_score"] # (N*1)
ious = iou(proj_lidar_box, img_box)
row_ind, col_ind = linear_sum_assignment(-ious)
#(row_ind, col_ind) = match_projected_idx, match_img_idx
new_col_ind = []
for i in range(len(row_ind)):
if ious[row_ind[i],col_ind[i]] > alpha0:
new_col_ind.append(col_ind[i])
index_list =[]
for i in range(len(row_ind)):
if (proj_lidar_score[row_ind[i]] + img_score[col_ind[i]])/2. > alpha1:
index_list.append(row_ind[i])
#unmatched but with high confidence score
for i in range(len(proj_lidar_score)):
if i in index_list:
continue
if proj_lidar_score[i] > alpha2:
index_list.append(i)
#final kept index for lidar box
print(index_list)
Experiments
下面是实验部分,首先是KITTI测试集结果对比。与全监督的PointRCNN相比,VG-W3D在不使用任何3D注释的情况下实现了相近的性能,证明了方法的有效性。此外,弱监督基线FGR相比,我们的方法在 A P 3 D AP_{3D} AP3D 中分别提高了3.83/5.81/6.33,这表明了所提出的视觉引导方法的有效性。此外,与需要3D或BEV中心注释的方法相比,我们的方法优于其他方法。
KITTI验证集上,与需要500帧3D注释的MTrans相比,VG-W3D在大多数指标上都取得了更好的结果,这验证了我们方法的有效性。此外,与仅利用类别标签对提案进行分类的VS3D[和WS3DPR相比,我们的方法可以从多个角度有效地指导学习,从而显著提高性能。
下面是消融实验。在表4中,我们展示了不同级别的视觉指导的有效性。我们复现的FGR基线 只实现了较好的性能,因为非学习方法的初始3D注释有噪声。无论是特征级、输出级还是训练级指导,都能提高baseline性能。
我们研究了表5中提出的特征级指导的有效性(是特征级和输出级损失一起使用)。在第一行中,当不学习从图像域学习到的目标物概率时,可以观察到性能下降。此外,将KL散度损失替换为L2损失并尝试模仿特征而不是目标预测会产生不满意的结果。最后,使用2D边界框作为前景监督的掩码是无效的,因为它无法为点云提供有价值的信号,例如框内的某些区域可能属于背景。使用分割图学习目标物前景区域和KL散度将有利于特征级指导。
在表6中,比较了输出级指导的各种选项。可以利用L1损失来学习从预训练图像检测器预测的 3D框的角点,而不是回归 2D 和 3D 边界框之间的 IoU。然而,由于初始 3D 框注释的噪声性质,这种方法效果较差,影响了图像检测器学习角点作为指导的能力。此外,我们的结果表明,GIoU损失比IoU损失产生更好的性能,因为投影的3D框通常与 2D 框不完全重叠。
表7中可以看到,添加重叠投影3D框和2D真值框的Overlap略微有助于训练。最后,集成我们的两个训练级指导(即Boverlap + Bscore)保留了高置信度分数的预测,使我们能够识别图像域中缺少注释的更多对象。这显着改善了伪标注过程。
在表8中,我们研究了初始伪标签和我们提出的伪标签方法的质量。通过非学习方法生成的 3D 标签通常包含噪声和缺失的信息,特别是当截锥点云缺乏明确的分离来准确定位边界框时。因此,召回率在 IoU=0.7 时仅达到 46.71%。在使用我们的方法进行第一轮训练后,伪标签的质量提高了大约 25% 的召回率,表明了我们方法的有效性。最后,伪标签的召回率进在第二轮训练后饱和。
我们的方法能够与现有的预训练的2D目标检测器结合,如表9所示。使用在COCO数据集上训练的DETR目标检测器,我们可以直接使用它来检测KITTI数据集上的目标并生成2D框。最初,我们使用FGR生成3D边界框并训练我们的模型,建立基线结果。随后,我们应用我们提出的视觉引导方法。