Literature Review: Apple and Baidu and Deep Neural Networks for Point Clouds

苹果近期发表了一篇关于其研发的深度神经网络VoxelNet的研究论文,该网络专为激光雷达点云上的三维目标检测而设计,是自动驾驶领域的一大突破。VoxelNet从两个百度相关研究中汲取灵感,通过特征学习网络、卷积中间层和区域提议网络三个部分,实现了端到端的学习。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >






Recently, Apple made what they must have known would be a big splash by silently publishing a research paper with results from a deep neural network that two of their researchers built.

The network and the paper in question were clearly designed for autonomous driving, which Apple has been working on, more or less in secret, for years.

The network in question — VoxelNet — has been trained to perform object detection on lidar point clouds. This isn’t a huge leap from object detection on images, which has been a topic of deep learning research for several years, but it is a new frontier in deep learning for autonomous vehicles. Kudos to Apple for publishing their results.

VoxelNet (by Apple), draws heavily on two previous efforts at applying deep learning to lidar point clouds, both by Baidu-affiliated researchers. Since the three papers kind of work as a trio, I did a quick scan of them together.

3D Fully Convolutional Network for Vehicle Detection in Point Cloud

Bo Li (Baidu)

Bo Li basically applies the DenseBox fully convolutional network (FCN) architecture to a three-dimensional point cloud.

To do this, Li:

  • Divides the point cloud into voxels. So instead of running 2D pixels through a network, we’re running 3D voxels.
  • Trains an FCN to identify features in the voxel-ized point cloud.
  • Upsamples the FCN to produce two output tensors: an objectness tensor, and a bounding box tensor.
  • The bounding box tensor is probably more interesting for perception purposes. It draws a bounding box around cars on the road.
  • Q.E.D.
Multi-View 3D Object Detection Network for Autonomous Driving

Xiaozhi ChenHuimin MaJi WanBo LiTian Xia (Tsinghua and Baidu)

A team of Tsinghua and Baidu researchers developed Multi-View 3D (MV3D) networks, which combine lidar and camera images in a complex neural network pipeline.

In contrast to Li’s solo work, which constructs voxels out of the lidar point cloud, MV3D simply takes two separate 2D views of the point cloud: one from the front and one from the top (birds’ eye). MV3D also uses the 2D camera image associated with each lidar scan.

That provides three separate 2D images (lidar front view, lidar top view, camera front view).

MV3D uses each view to create a bounding box in two-dimensions. Birds-eye view lidar created a bounding box parallel to the ground, whereas front-view lidar and camera view each create a 2D bounding box perpendicular to the ground. Combining these 2D bounding boxes creates a 3D bounding box to draw around the vehicle.

At the end of the network, MV3D employs something called “deep fusion” to combine output from each of the three neural network pipelines (one associated with each view). I’ll be honest — I don’t really understand how “deep fusion” works, so leave me a note in the comments if you can follow what they’re doing.

The results are a classification of the object and a bounding box around it.

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

Yin ZhouOncel Tuzel (Apple)

That brings us to VoxelNet, from Apple, which got so much press recently.

VoxelNet has three components, in order:

  • Feature Learning Network
  • Convolutional Middle Layers
  • Region Proposal Network

The Feature Learning Network seems to be the main “contribution to knowledge”, as the scholars say.

It seems that what this network does is start with a semi-random sample of points from within “interesting” (my word, not theirs) voxels. This sample of points gets run through a fully-connected (not fully-convolutional) network. This network learns point-wise features which are relevant to the voxel from which the points came.

The network, in fact, uses these point-wise features to develop voxel-wise features that describe each of the “interesting” voxels. I’m oversimplifying wildly, but think of this as learning features that describe each voxel and are relevant to classifying the part of the vehicle that is in that voxel. So a voxel might have features like “black”, “rubber”, and “treads”, and so you could guess that the voxel captures part of a tire. Of course, the real features won’t necessarily be intelligible by humans, but that’s the idea.

These voxel-wise features can then get pumped through the Convolutional Middle Layers and finally through the Region Proposal Network and, voila, out come bounding boxes and classifications.


One of the most impressive parts of this line of research is just how new it is. The two Baidu papers were both first published online a year ago, and only made it into conferences in the last six months. The Apple paper only just appeared online in the last couple of weeks.

It’s an exciting time to be building deep neural networks for autonomous vehicles.

原文地址: 

Recently, Apple made what they must have known would be a big splash by silently publishing a research paper with results from a deep neural network that two of their researchers built.

The network and the paper in question were clearly designed for autonomous driving, which Apple has been working on, more or less in secret, for years.

The network in question — VoxelNet — has been trained to perform object detection on lidar point clouds. This isn’t a huge leap from object detection on images, which has been a topic of deep learning research for several years, but it is a new frontier in deep learning for autonomous vehicles. Kudos to Apple for publishing their results.

VoxelNet (by Apple), draws heavily on two previous efforts at applying deep learning to lidar point clouds, both by Baidu-affiliated researchers. Since the three papers kind of work as a trio, I did a quick scan of them together.

3D Fully Convolutional Network for Vehicle Detection in Point Cloud

Bo Li (Baidu)

Bo Li basically applies the DenseBox fully convolutional network (FCN) architecture to a three-dimensional point cloud.

To do this, Li:

  • Divides the point cloud into voxels. So instead of running 2D pixels through a network, we’re running 3D voxels.
  • Trains an FCN to identify features in the voxel-ized point cloud.
  • Upsamples the FCN to produce two output tensors: an objectness tensor, and a bounding box tensor.
  • The bounding box tensor is probably more interesting for perception purposes. It draws a bounding box around cars on the road.
  • Q.E.D.
Multi-View 3D Object Detection Network for Autonomous Driving

Xiaozhi ChenHuimin MaJi WanBo LiTian Xia (Tsinghua and Baidu)

A team of Tsinghua and Baidu researchers developed Multi-View 3D (MV3D) networks, which combine lidar and camera images in a complex neural network pipeline.

In contrast to Li’s solo work, which constructs voxels out of the lidar point cloud, MV3D simply takes two separate 2D views of the point cloud: one from the front and one from the top (birds’ eye). MV3D also uses the 2D camera image associated with each lidar scan.

That provides three separate 2D images (lidar front view, lidar top view, camera front view).

MV3D uses each view to create a bounding box in two-dimensions. Birds-eye view lidar created a bounding box parallel to the ground, whereas front-view lidar and camera view each create a 2D bounding box perpendicular to the ground. Combining these 2D bounding boxes creates a 3D bounding box to draw around the vehicle.

At the end of the network, MV3D employs something called “deep fusion” to combine output from each of the three neural network pipelines (one associated with each view). I’ll be honest — I don’t really understand how “deep fusion” works, so leave me a note in the comments if you can follow what they’re doing.

The results are a classification of the object and a bounding box around it.

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

Yin ZhouOncel Tuzel (Apple)

That brings us to VoxelNet, from Apple, which got so much press recently.

VoxelNet has three components, in order:

  • Feature Learning Network
  • Convolutional Middle Layers
  • Region Proposal Network

The Feature Learning Network seems to be the main “contribution to knowledge”, as the scholars say.

It seems that what this network does is start with a semi-random sample of points from within “interesting” (my word, not theirs) voxels. This sample of points gets run through a fully-connected (not fully-convolutional) network. This network learns point-wise features which are relevant to the voxel from which the points came.

The network, in fact, uses these point-wise features to develop voxel-wise features that describe each of the “interesting” voxels. I’m oversimplifying wildly, but think of this as learning features that describe each voxel and are relevant to classifying the part of the vehicle that is in that voxel. So a voxel might have features like “black”, “rubber”, and “treads”, and so you could guess that the voxel captures part of a tire. Of course, the real features won’t necessarily be intelligible by humans, but that’s the idea.

These voxel-wise features can then get pumped through the Convolutional Middle Layers and finally through the Region Proposal Network and, voila, out come bounding boxes and classifications.


One of the most impressive parts of this line of research is just how new it is. The two Baidu papers were both first published online a year ago, and only made it into conferences in the last six months. The Apple paper only just appeared online in the last couple of weeks.

It’s an exciting time to be building deep neural networks for autonomous vehicles.


https://medium.com/self-driving-cars/literature-review-apple-and-baidu-and-deep-neural-networks-for-point-clouds-f4f4104fa901

Recently, Apple made what they must have known would be a big splash by silently publishing a research paper with results from a deep neural network that two of their researchers built.

The network and the paper in question were clearly designed for autonomous driving, which Apple has been working on, more or less in secret, for years.

The network in question — VoxelNet — has been trained to perform object detection on lidar point clouds. This isn’t a huge leap from object detection on images, which has been a topic of deep learning research for several years, but it is a new frontier in deep learning for autonomous vehicles. Kudos to Apple for publishing their results.

VoxelNet (by Apple), draws heavily on two previous efforts at applying deep learning to lidar point clouds, both by Baidu-affiliated researchers. Since the three papers kind of work as a trio, I did a quick scan of them together.

3D Fully Convolutional Network for Vehicle Detection in Point Cloud

Bo Li (Baidu)

Bo Li basically applies the DenseBox fully convolutional network (FCN) architecture to a three-dimensional point cloud.

To do this, Li:

  • Divides the point cloud into voxels. So instead of running 2D pixels through a network, we’re running 3D voxels.
  • Trains an FCN to identify features in the voxel-ized point cloud.
  • Upsamples the FCN to produce two output tensors: an objectness tensor, and a bounding box tensor.
  • The bounding box tensor is probably more interesting for perception purposes. It draws a bounding box around cars on the road.
  • Q.E.D.
Multi-View 3D Object Detection Network for Autonomous Driving

Xiaozhi ChenHuimin MaJi WanBo LiTian Xia (Tsinghua and Baidu)

A team of Tsinghua and Baidu researchers developed Multi-View 3D (MV3D) networks, which combine lidar and camera images in a complex neural network pipeline.

In contrast to Li’s solo work, which constructs voxels out of the lidar point cloud, MV3D simply takes two separate 2D views of the point cloud: one from the front and one from the top (birds’ eye). MV3D also uses the 2D camera image associated with each lidar scan.

That provides three separate 2D images (lidar front view, lidar top view, camera front view).

MV3D uses each view to create a bounding box in two-dimensions. Birds-eye view lidar created a bounding box parallel to the ground, whereas front-view lidar and camera view each create a 2D bounding box perpendicular to the ground. Combining these 2D bounding boxes creates a 3D bounding box to draw around the vehicle.

At the end of the network, MV3D employs something called “deep fusion” to combine output from each of the three neural network pipelines (one associated with each view). I’ll be honest — I don’t really understand how “deep fusion” works, so leave me a note in the comments if you can follow what they’re doing.

The results are a classification of the object and a bounding box around it.

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

Yin ZhouOncel Tuzel (Apple)

That brings us to VoxelNet, from Apple, which got so much press recently.

VoxelNet has three components, in order:

  • Feature Learning Network
  • Convolutional Middle Layers
  • Region Proposal Network

The Feature Learning Network seems to be the main “contribution to knowledge”, as the scholars say.

It seems that what this network does is start with a semi-random sample of points from within “interesting” (my word, not theirs) voxels. This sample of points gets run through a fully-connected (not fully-convolutional) network. This network learns point-wise features which are relevant to the voxel from which the points came.

The network, in fact, uses these point-wise features to develop voxel-wise features that describe each of the “interesting” voxels. I’m oversimplifying wildly, but think of this as learning features that describe each voxel and are relevant to classifying the part of the vehicle that is in that voxel. So a voxel might have features like “black”, “rubber”, and “treads”, and so you could guess that the voxel captures part of a tire. Of course, the real features won’t necessarily be intelligible by humans, but that’s the idea.

These voxel-wise features can then get pumped through the Convolutional Middle Layers and finally through the Region Proposal Network and, voila, out come bounding boxes and classifications.


One of the most impressive parts of this line of research is just how new it is. The two Baidu papers were both first published online a year ago, and only made it into conferences in the last six months. The Apple paper only just appeared online in the last couple of weeks.

It’s an exciting time to be building deep neural networks for autonomous vehicles.

内容概要:本文档为《400_IB Specification Vol 2-Release-2.0-Final-2025-07-31.pdf》,主要描述了InfiniBand架构2.0版本的物理层规范。文档详细规定了链路初始化、配置与训练流程,包括但不限于传输序列(TS1、TS2、TS3)、链路去偏斜、波特率、前向纠错(FEC)支持、链路速度协商及扩展速度选项等。此外,还介绍了链路状态机的不同状态(如禁用、轮询、配置等),以及各状态下应遵循的规则和命令。针对不同数据速率(从SDR到XDR)的链路格式化规则也有详细说明,确保数据包格式和控制符号在多条物理通道上的一致性和正确性。文档还涵盖了链路性能监控和错误检测机制。 适用人群:适用于从事网络硬件设计、开发及维护的技术人员,尤其是那些需要深入了解InfiniBand物理层细节的专业人士。 使用场景及目标:① 设计和实现支持多种数据速率和编码方式的InfiniBand设备;② 开发链路初始化和训练算法,确保链路两端设备能够正确配置并优化通信质量;③ 实现链路性能监控和错误检测,提高系统的可靠性和稳定性。 其他说明:本文档属于InfiniBand贸易协会所有,为专有信息,仅供内部参考和技术交流使用。文档内容详尽,对于理解和实施InfiniBand接口具有重要指导意义。读者应结合相关背景资料进行学习,以确保正确理解和应用规范中的各项技术要求。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值