Frustum PointNets for 3D Object Detection from RGB-D Data 原文 + 阅读笔记

最新推荐文章于 2024-03-14 20:01:40 发布

原创

最新推荐文章于 2024-03-14 20:01:40 发布 · 1.3k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#计算机视觉 #3D目标检测 #computer vision #3D Object Detection

Abstract

In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability.

之前方法的不足/局限：通常的三维目标（用图像获得，3D体素）检测方法忽略了3D目标的自然状态和3D数据的不变性；
论文中所采用的主要方法：直接在原始点云上操作
该方法应用上的主要难点：怎样在一个较大的场景点云表示中高效地定位三维目标
论文中方法的优越性：达到了对目标的高效定位，并且对很小的目标也有很高的回调率（recall）；即使在强遮挡和点云十分稀疏的情况下，也能够准确地估计出3D bounding box；具有很好的实时性

1. Introduction

Recently, great progress has been made on 2D image understanding tasks, such as object detection [10] and instance segmentation [11]. However, beyond getting 2D bounding boxes or pixel masks, 3D understanding is eagerly in demand in many applications such as autonomous driving and augmented reality (AR). With the popularity of 3D sensors deployed on mobile devices and autonomous vehicles, more and more 3D data is captured and processed. In this work, we study one of the most important 3D perception tasks – 3D object detection, which classifies the object category and estimates oriented 3D bounding boxes of physical objects from 3D sensor data.

背景：2维图像理解方面工作的进展飞速，3维理解方面的工作在许多应用领域如自动驾驶和AR方面也是迫切需要。
前提：3D sensor提供了大量可供处理的三维数据

While 3D sensor data is often in the form of point clouds, how to represent point cloud and what deep net architectures to use for 3D object detection remains an open problem. Most existing works convert 3D point clouds to images by projection [30, 21] or to volumetric grids by quantization [33, 18, 21] and then apply convolutional networks. This data representation transformation, however, may
在这里插入图片描述
obscure natural 3D patterns and invariances of the data. Recently, a number of papers have proposed to process point clouds directly without converting them to other formats. For example, [20, 22] proposed new types of deep net architectures, called PointNets, which have shown superior performance and efficiency in several 3D understanding tasks such as object classification and semantic segmentation.
While PointNets are capable of classifying a whole point cloud or predicting a semantic class for each point in a point cloud, it is unclear how this architecture can be used for instance-level 3D object detection. Towards this goal, we have to address one key challenge: how to efficiently propose possible locations of 3D objects in a 3D space. Imitating the practice in image detection, it is straightforward to enumerate candidate 3D boxes by sliding windows [7] or by 3D region proposal networks such as [27]. However, the computational complexity of 3D search typically grows cubically with respect to resolution and becomes too expensive for large scenes or real-time applications such as autonomous driving.
Instead, in this work, we reduce the search space following the dimension reduction principle: we take the advantage of mature 2D object detectors (Fig. 1). First, we extract the 3D bounding frustum of an object by extruding 2D bounding boxes from image detectors. Then, within the 3D space trimmed by each of the 3D frustums, we consecutively perform 3D object instance segmentation and amodal 3D bounding box regression using two variants of PointNet. The segmentation network predicts the 3D mask of the object of interest (i.e. instance segmentation); and the regression network estimates the amodal 3D bounding box (covering the entire object even if only part of it is visible).

之前相关工作中对于三维点云的处理方法：
投影/通过quantization得到体积网络 -> 应用CNN
问题：没有充分体现3D目标的自然状态，和3D数据的不变性

直接处理3D点云的方法：PointNet
在该研究方向上的局限/问题：对于是否能应用于instance-level的3D目标检测尚不明确

针对传统方法的高计算复杂度的问题：
本文中降低搜索区域的方法：
1）通过2D目标检测，提取3D的目标锥形点云区域；
2）两个变化的PointNet网络模型实现分割和目标检测；分割网络实现实例分割，回归网络估计目标的三维框的位置；
由于投影矩阵是已知的，这样就可以从二维图像区域得到三维截锥了.

In contrast to previous work that treats RGB-D data as 2D maps for CNNs, our method is more 3D-centric as we lift depth maps to 3D point clouds and process them using 3D tools. This 3D-centric view enables new capabilities for exploring 3D data in a more effective manner. First, in our pipeline, a few transformations are applied successively on 3D coordinates, which align point clouds into a sequence of more constrained and canonical frames. These alignments factor out pose variations in data, and thus make 3D geometry pattern more evident, leading to an easier job of 3D learners. Second, learning in 3D space can better exploits the geometric and topological structure of 3D space. In principle, all objects live in 3D space; therefore, we believe that many geometric structures, such as repetition, planarity, and symmetry, are more naturally parameterized and captured by learners that directly operate in 3D space. The usefulness of this 3D-centric network design philosophy has been supported by much recent experimental evidence.
Our method achieves leading positions on KITTI 3D object detection [1] and bird’s eye view detection [2] benchmarks. Compared with the previous state of the art [5], ou

最低0.47元/天解锁文章