PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection
前言:
搜了一些 PVRCNN 相关的博客,对于网络结构的有些地方还是不是很理解,就仔细阅读了以下原文,发现原论文讲的很透彻,一下子就理解了,因此对原论文做了翻译,也算是加强自己的记忆。
论文地址:https://arxiv.org/pdf/1912.13192.pdf
- 文中出现的一些框的名词:
- ground truth 指标注框;
- anchor 指人为初始给的先验框,一般在two-stage 检测器的RPN阶段或者在 one-stage 检测器中设置;
- proposal 指的是 two-stage 方法中RPN的输出框, 也就是对 anchor 第一次做回归得到的结果;
- RoI 不是一个边界框,是指RPN阶段输出的proposal经过排序取topk,然后做nms取一定数量的框,用于第二阶段的再次精修;
- bounding box 指 proposal 经过再次精修后的预测框;
Abstract
We present a novel and high-performance 3D object detection framework, named PointVoxel-RCNN (PV-RCNN), for accurate 3D object detection from point clouds. Our proposed method deeply integrates both 3D voxel Convolutional Neural Network (CNN) and PointNet-based set abstraction to learn more discriminative point cloud features. It takes advantages of efficient learning and high-quality proposals of the 3D voxel CNN and the flexible receptive fields of the PointNet-based networks. Specifically, the proposed framework summarizes the 3D scene with a 3D voxel CNN into a small set of keypoints via a novel voxel set abstraction module to save follow-up computations and also to encode representative scene features. Given the highquality 3D proposals generated by the voxel CNN, the RoI-grid pooling is proposed to abstract proposal-specific features from the keypoints to the RoI-grid points via keypoint set abstraction with multiple receptive fields. Compared with conventional pooling operations, the RoI-grid feature points encode much richer context information for accurately estimating object confidences and locations. Extensive experiments on both the KITTI dataset and the Waymo Open dataset show that our proposed PV-RCNN surpasses state-of-the-art 3D detection methods with remarkable margins by using only point clouds.
Code is available at https://github.com/open-mmlab/OpenPCDet
我们提出了一种新的高性能三维目标检测框架,称之为: PointVoxel-RCNN(PV-RCNN),用于从点云中检测3D对象。我们将 3d voxel CNN 和基于点的特征集合进行了深度集成,从而得到更具辨别力的点云特征。它采用了 3D voxel CNN的高效学习和高质量 proposals 的优势,以及 PointNet 的网络的灵活感受野。
具体来说,该框架通过一个新的 voxel SA model 将三维场景提取成一系列的关键点组,以减少后续计算,并对具有代表性的场景特征进行编码。首先通过 voxel CNN 生成的高质量3D proposals,再通过多尺度的感受野对关键点提取特征,将 proposals 的特征从关键点抽象到 RoI-grid 点,从而提出了 RoI-grid pooling。
与传统的 pooling 操作相比,RoI-grid 特征点编码了更丰富的上下文信息,用于更准确的估计目标的可信度和位置。
在 KITTI 数据集和 Waymo Open 数据集上进行的大量实验表明,我们提出的PV-RCNN通过仅使用点云,超越了最先进的3D检测方法,具有显著的优势。
code 地址:https://github.com/open-mmlab/OpenPCDet
1. Introduction
3D object detection has been receiving increasing attention from both industry and academia thanks to its wide applications in various fields such as autonomous driving and robotics. LiDAR sensors are widely adopted in autonomous driving vehicles and robots for capturing 3D scene information as sparse and irregular point clouds, which provide vital cues for 3D scene perception and understanding. In this paper, we propose to achieve high performance 3D object detection by designing novel point-voxel integrated networks to learn better 3D features from irregular point clouds.
Most existing 3D detection methods could be classified into two categories in terms of point cloud representations, i.e., the grid based methods and the point-based methods. The grid-based methods generally transform the irregular point clouds to regular representations such as 3D voxels [27, 41, 34, 2, 26] or 2D bird-view maps[1, 11, 36, 17, 35, 12, 16], which could be efficiently processed by 3D or 2D Convolutional Neural Networks (CNN) to learn point features for 3D detection. Powered by the pioneer work, PointNet and its variants [23, 24], the pointbased methods [22, 25, 32, 37] directly extract discriminative features from raw point clouds for 3D detection. Generally, the grid-based methods are more computationally efficient but the inevitable information loss degrades the finegrained localization accuracy, while the point-based methods have higher computation cost but could easily achieve larger receptive field by the point set abstraction [24]. However, we show that a unified framework could integrate the best of the two types of methods, and surpass the prior stateof-the-art 3D detection methods with remarkable margins.
We propose a novel 3D object detection framework, PVRCNN (Illustrated in Fig. 1), which boosts the 3D detection performance by incorporating the advantages from both the Point-based and Voxel-based feature learning methods. The principle of PV-RCNN lies in the fact that the voxel-based operation efficiently encodes multi-scale feature representations and can generate high-quality 3D proposals, while the PointNet-based set abstraction operation preserves accurate location information with flexible receptive fields. We argue that the integration of these two types of feature learning frameworks can help learn more discriminative features for accurate fine-grained box refinement.
The main challenge would be how to effectively combine the two types of feature learning schemes, specifically the 3D voxel CNN with sparse convolutions [6, 5] and the PointNet-based set abstraction [24], into a unified framework. An intuitive solution would be uniformly sampling several grid points within each 3D proposal, and adopt the set abstraction to aggregate 3D voxel-wise features surrounding these grid points for proposal refinement. However, this strategy is highly memory-intensive since both the number of voxels and the number of grid points could be quite large to achieve satisfactory performance.
Therefore, to better integrate these two types of point cloud feature learning networks, we propose a two-step strategy with the first voxel-to-keypoint scene encoding step and the second keypoint-to-grid RoI feature abstraction step. Specifically, a voxel CNN with 3D sparse convolution is adopted for voxel-wise feature learning and accurate proposal generation. To mitigate the above mentioned issue of requiring too many voxels for encoding the whole scene, a small set of keypoints are selected by the furtherest point sampling (FPS) to summarize the overall 3D information from the voxel-wise features. The features of each keypoint is aggregated by grouping the neighboring voxel-wise features via PointNet-based set abstraction for summarizing multi-scale point cloud information. In this way, the overall scene can be effectively and efficiently encoded by a small number of keypoints with associated multi-scale features.
For the second keypoint-to-grid RoI feature abstraction step, given each box proposal with its grid point locations, a RoI-grid pooling module is proposed, where a keypoint set abstraction layer with multiple radii is adopted for each grid point to aggregate the features from the keypoints with multi-scale context. All grid points’ aggregated features can then be jointly used for the succeeding proposal refinement. Our proposed PV-RCNN effectively takes advantages of both point-based and voxel-based networks to encode discriminative features at each box proposal for accurate confidence prediction and fine-grained box refinement.
Our contributions can be summarized into four-fold. (1) We propose PV-RCNN framework which effectively takes advantages of both the voxel-based and point-based methods for 3D point-cloud feature learning, leading to improved performance of 3D object detection with manageable memory consumption. (2) We propose the voxelto-keypoint scene encoding scheme, which encodes multiscale voxel features of the whole scene to a small set of keypoints by the voxel set abstraction layer. These keypoint features not only preserve accurate location but also encode rich scene context, which boosts the 3D detection performance significantly. (3) We propose a multi-scale RoI feature abstraction layer for grid points in each proposal, which aggregates richer context information from the scene with multiple receptive fields for accurate box refinement and confidence prediction. (4) Our proposed method PV-RCNN outperforms all previous methods with remarkable margins and ranks 1st on the highly competitive KITTI 3D detection benchmark [10], ans also surpasses previous methods on the large-scale Waymo Open dataset with a large margin.
由于三维目标检测在自动驾驶和机器人等领域的广泛应用,它受到了工业界和学术界越来越多的关注。激光雷达传感器广泛应用于自动驾驶车辆和机器人中,以稀疏不规则的点云形式捕捉三维场景信息,为三维场景感知和理解提供重要线索。在本文中,我们提出通过设计新的 point-voxel 集成网络来实现高性能的三维目标检测,从而从不规则点云中学习更好的三维特征。
现有的大多数三维检测方法根据点云表示可分为两类,即基于网格的方法和基于点的方法。基于网格的方法通常将不规则点云转换为规则表示,例如3D voxel 或2D鸟瞰图,3D或2D卷积神经网络(CNN)可以有效地处理这些点云,以学习用于3D检测的点特征。在 PointNet 及其变体(PointNet++) 的推动下,基于点的方法可直接从原始点云中提取鉴别特征,用于3D检测。一般来说,基于网格的方法计算效率更高,但不可避免的信息丢失会降低细粒度定位精度,而基于点的方法计算成本更高,但通过point set abstraction 很容易获得更大的感受野。我们提出了一个统一框架,该框架可以整合这两种方法,并以显著的优势超越现有的最先进的3D检测方法。
我们提出了一种新的三维目标检测框架PVRCNN(如图1所示),它结合了基于点和基于体素的特征学习方法的优点,提高了三维检测性能。PV-RCNN的原理在于,基于体素的操作可以有效地对多尺度特征表示进行编码,并可以生成高质量的3D proposals,而基于PointNet的 SA 操作可以通过灵活的感受野保留准确的位置信息。我们认为,这两种类型的特征学习框架的集成可以帮助学习更具辨别力的特征,从而实现精确的细粒度 box refinement。
主要的挑战是如何将两种类型的特征学习方案有效地结合到一个统一的框架中,特别是3D voxel CNN和稀疏卷积 以及基于 PointNet 的SA操作。直观的解决方案是在每个3D proposal 中均匀地采样几个网格点,并采用 SA 来聚合围绕这些网格点的3D体素特征,以便 proposal refinement。然而,这种策略是高度内存密集型的,因为体素的数量和网格点的数量都可能非常大,才能实现令人满意的性能。
因此,为了更好地集成这两种类型的点云特征学习网络,我们提出了一种两步策略,第一步是体素到关键点场景编码,第二步是关键点到网格RoI特征提取。具体而言,采用具有3D稀疏卷积的 voxel CNN进行体素特征学习和生成精确 proposal 。为了缓解上述需要太多体素来编码整个场景的问题,通过最远点采样(FPS)选择一小组关键点,以总结体素特征的整体3D信息。通过PointNet-based 的集合抽象对相邻的体素特征进行分组,以汇总多尺度点云信息,从而聚合每个关键点的特征。这种方式是通过少量具有相关多尺度特征的关键点对整个场景进行有效且高效的编码。
对于第二个关键点到网格RoI特征提取步骤,考虑到每个 box proposal 及其网格点位置,提出了一个RoI-grid Pooling 模块,在该模块的keypoint SA 层中每个网格点采用多个半径,从关键点多尺度的上下文信息中,来聚合网格点的特征。然后,可以将所有网格点的聚合特征拼接用于后续proposal refinement。我们提出的PV-RCNN有效地利用了基于点和基于体素的网络的优点,在每个 box proposal 中对易辨别的特征进行编码,以实现准确的置信度预测和细粒度(fine-grained) box refinement。
我们的贡献可以概括为四个方面。
(1) 我们提出了PV-RCNN框架,该框架有效地利用了基于体素和基于点的三维点云特征学习方法,在可管理的内存消耗下提高了三维目标检测的性能。
(2) 我们提出了体素到关键点场景编码方案,该方案通过voxel SA层将整个场景的多尺度体素特征编码为一个小的关键点集。这些关键点特征不仅保留了精确的位置,还包含丰富的场景信息,显著提高了3D检测性能。
(3) 我们为每个proposal 中的网格点提出了一个多尺度RoI特征抽象层,该层通过多个感受野聚合场景中丰富的上下文信息,以实现精确的框细化和置信度预测。
(4) 我们提出的方法PV-RCNN比以前所有的方法都有显著的优势,在竞争激烈的KITTI 3D检测基准中排名第一,对比在Waymo数据集上的方法也有很大的优势。
2. Related Work
3D Object Detection with Grid-based Methods. To tackle the irregular data format of point clouds, most existing works project the point clouds to regular grids to be processed by 2D or 3D CNN. The pioneer work MV3D [1] projects the point clouds to 2D bird view grids and places lots of predefined 3D anchors for generating 3D bounding boxes, and the following works [11, 17, 16] develop better strategies for multi-sensor fusion while [36, 35, 12] propose more efficient frameworks with bird view representation. Some other works [27, 41] divide the point clouds into 3D voxels to be processed by 3D CNN, and 3D sparse convolution [5] is introduced [34] for efficient 3D voxel processing. [30, 42] utilizes multiple detection heads while [26] explores the object part locations for improving the performance. These grid-based methods are generally efficient for accurate 3D proposal generation but the receptive fields are constraint by the kernel size of 2D/3D convolutions.
3D Object Detection with Point-based Methods. FPointNet [22] first proposes to apply PointNet [23, 24] for 3D detection from the cropped poin