3D (Input) Sparse Convolution

原创

已于 2022-05-27 16:49:01 修改 · 933 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #机器学习 #人工智能

于 2022-05-24 17:28:13 首次发布

本文探讨了深度学习中2D稀疏权重的特性，与3D点云中因数据高维和结构不规则导致的内存效率问题。重点介绍了Submanifold Sparse Convolutional Networks (SSCNs)的方法，它通过减少内存消耗和结构化稀疏性来优化3D稀疏卷积运算，包括SC和SSC操作。此外，文章还提到了Minkowski CNN和Minkowski Engine在处理高维稀疏数据中的应用，以及针对点云的TorchSparse库加速技术。

Review:

2D Weight sparsity in DNNs

(Weight) Sparsity in Deep Learning_EverNoob的博客-优快云博客

==> the above mentioned 2D sparsity is decidedly different from the 3D input sparsity scenario, in that we manually created the structured sparse weight to cut down memory footprint, while the 3D situation is caused by unstructured nature of point cloud or just the nature of high dimensionality;

====> the long and short of the unstructured sparse point cloud is: if we keep using dense operations, which the general purposed hardwares are poised to perform, we have to waste a LOT of time and energy on moving and computing 0s.

==> just check how we want the data to be structured for parallelled hardwares and how we exploited the structued sparsity for performance gains;

==> the main goal or method for 3D sparse convolution acceleration is hence to:

reduce memory footprint;

structurally format the sparsity (for efficient parallel processing)

Submanifold Sparse Convolutional Networks

the seminal paper that introduced the base method for currently popular sparse 3D convolution acceleration solutions:

https://arxiv.org/abs/1711.10275

[Submitted on 28 Nov 2017]

3D Semantic Segmentation with Submanifold Sparse Convolutional Networks

Benjamin Graham, Martin Engelcke, Laurens van der Maaten

Convolutional networks are the de-facto standard for analyzing spatio-temporal data such as images, videos, and 3D shapes. Whilst some of this data is naturally dense (e.g., photos), many other data sources are inherently sparse. Examples include 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard "dense" implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce new sparse convolutional operations that are designed to process spatially-sparse data more efficiently, and use them to develop spatially-sparse convolutional networks. We demonstrate the strong performance of the resulting models, called submanifold sparse convolutional networks (SSCNs), on two tasks involving semantic segmentation of 3D point clouds. In particular, our models outperform all prior state-of-the-art on the test set of a recent semantic segmentation competition.

key idea/purpose

One of the downsides of prior sparse implementations of convolutional networks is that they “dilate” the sparse data in every layer by applying “full” convolutions. In this work, we show that it is possible to create convolutional networks that keep the same level of sparsity throughout the network. To this end, we develop a new implementation for performing sparse convolutions (SCs) and introduce a novel convolution operator termed submanifold sparse convolution (SSC).1 We use these operators as the basis for submanifold sparse convolutional networks (SSCNs) that are optimized for efficient semantic segmentation of 3D point clouds

Definitions and Spatial Sparsity

We define a d-dimensional convolutional network as a network that takes as

input a (d + 1)-dimensional tensor: the input tensor contains d spatio-temporal dimensions (such as length, width, height, time, etc.) and one additional feature-space dimension (e.g., RGB color channels or surface normal vectors).

The input corresponds to a d-dimensional grid of sites, each of which is associated with a feature vector.

We define a site in the input to be active if any element in the feature vector is not in its ground state, e.g., if it is non-zero3.

In many problems, thresholding may be used to eliminate input sites at which the feature vector is within a small distance from the ground state.

Note that even though the input tensor is (d + 1)- dimensional, activity is a d-dimensional phenomenon: entire lines along the feature dimension are either active or inactive ==> e.g. a point either exists or not in point cloud, so, naturally, does its feature vector.

Similarly, the hidden layers of a d-dimensional convolutional network are represented by d-dimensional grids of feature-space vectors. a site in a hidden layer is active if any of the sites in the layer that it takes as input is active state.

The value of the ground state only needs to be calculated once per forward pass at training time, and only once for all forward passes at test time. This allows for substantial savings in computational and memory requirements; the exact savings depend on data sparsity and network depth.