YOLO3D端到端的3d物体检测论文笔记

最新推荐文章于 2025-07-29 16:12:34 发布

原创最新推荐文章于 2025-07-29 16:12:34 发布 · 1.2w 阅读

25 ·

CC 4.0 BY-SA版权

机器学习系列专栏收录该内容

1 篇文章

订阅专栏

YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud

论文地址传送门

这篇论文将Yolo应用到 3D 物体检测，在KITTI 数据集下利用Titan X GPU达到了40Fp的性能。

本文的主要贡献有以下几点：

1- Extending YOLO V2[3] to include orientation of the OBB as a direct regression task.

2- Extending YOLO V2[3] to include the height and 3D OBB center coordinates (x,y,z) as a direct regression task.

3- Real-time performance evaluation and experimentation with Titan X GPU, on the challenging KITTI benchmark, with recommendations of the best grid-map resolution, and operating IoU threshold that balances speed and accuracy.

Point Cloud Representation

首先将点阵云投射到2D 鸟瞰网格图中，总共创建了两张图，一张图中的每个cell(pixel)的值为相关点的最高值；另一张图的每个cell(pixel)的值为点的密度，每个网格cell中的点越多值越大。密度的计算方式跟MV3D paper一样：

$min(1.0,\frac{log(N+1)}{log(64)})$

Yaw Angle Regression

预测框的方向角取值范围为-π到π，归一化为-1到1，并利用均方差计算损失函数：

$\sum_{i=0}^{s^{2}}\sum_{j=0}^{B}L_{ij}^{obj}\left ( \phi_{i} - \hat{\phi_{i}}} \right )^{^{2}}$

3D Bounding Box Regression

这一部分更Yolo_V2一样，只是扩展到了三维。唯一要注意的是高度Z的值只映射到一个网格中，而不是像xy一样映射到所有网格，这是由于物体的高度相差不大，可变度非常小。

$b_{x}=\sigma (t_{x})+c_{x}$

$b_{y}=\sigma (t_{y})+c_{y}$

$b_{z}=\sigma (t_{z})+c_{z}$

$b_{w}=p_{w}e^{t_{w}}$

$b_{l}=p_{l}e^{t_{l}}$

$b_{h}=p_{h}e^{t_{h}}$

Anchors Calculation

Yolo_v2中利用K均值聚类得到了很多大小不一的Anchors，基于这样的先验知识能够覆盖到数据可能出现的所有范围的框，这样可以利用不同大小的框检测到不同大小的物体。然后汽车的大小相对来说比较固定，所以本文实现没有利用K均值聚类产生大小不同的先验框，而是计算3D boxs的均值作为先验框的大小。

Combined Loss for 3D OBB

总体的Loss加了几个维度，其他处理一样。

Network Architecture and Hyper Parameters

相比于yolo_v2网络结构的一些改动：

We modified one max-pooling layer to change the down-sampling from 32 to 16 so we can have a larger grid at the end; this has a contribution in detecting small objects like pedestrians and cyclists.
We removed the skip connection from the model as we found it resulting in less accurate results.
We added terms in the loss function for yaw, z center coordinate, and height regressions to facilitate the 3D oriented bounding box detection.
Our input consists of 2 channels, one representing the maximum height, and the other one representing the density of points in the point cloud, computed as shown in Eq. (1)