Compact 3D Gaussian Splatting For Dense Visual SLAM
用于密集视觉冲击的紧凑三维高斯散射
Abstract 摘要 Compact 3D Gaussian Splatting For Dense Visual SLAM
Recent work has shown that 3D Gaussian-based SLAM enables high-quality reconstruction, accurate pose estimation, and real-time rendering of scenes. However, these approaches are built on a tremendous number of redundant 3D Gaussian ellipsoids, leading to high memory and storage costs, and slow training speed. To address the limitation, we propose a compact 3D Gaussian Splatting SLAM system that reduces the number and the parameter size of Gaussian ellipsoids. A sliding window-based masking strategy is first proposed to reduce the redundant ellipsoids. Then we observe that the covariance matrix (geometry) of most 3D Gaussian ellipsoids are extremely similar, which motivates a novel geometry codebook to compress 3D Gaussian geometric attributes, i.e., the parameters. Robust and accurate pose estimation is achieved by a global bundle adjustment method with reprojection loss. Extensive experiments demonstrate that our method achieves faster training and rendering speed while maintaining the state-of-the-art (SOTA) quality of the scene representation.
最近的工作表明,基于3D高斯的SLAM能够实现场景的高质量重建、精确姿态估计和实时渲染。然而,这些方法都是建立在大量冗余的3D高斯椭球体上,导致内存和存储成本高,训练速度慢。为了解决这个问题,我们提出了一个紧凑的三维高斯溅射SLAM系统,减少了高斯椭球的数量和参数大小。首先提出了一种基于滑动窗口的掩模策略来减少冗余椭球。然后,我们观察到大多数3D高斯椭球的协方差矩阵(几何形状)非常相似,这促使一种新的几何码本来压缩3D高斯几何属性,即,参数。鲁棒性和准确的姿态估计是通过一个全球光束法平差方法与重投影损失。 大量的实验表明,我们的方法实现了更快的训练和渲染速度,同时保持了最先进的(SOTA)质量的场景表示。
Figure 1:Our framework minimizes storage and accelerates rendering while maintaining the SOTA image reconstruction performance. The proposed framework eliminates unnecessary 3D Gaussian ellipsoids without affecting performance. We highlight and enlarge some areas to show the significant reduction of 3D Gaussian points.
图1:我们的框架最大限度地减少了存储并加速了渲染,同时保持了SOTA图像重建性能。所提出的框架消除了不必要的三维高斯椭球,而不影响性能。我们突出显示并放大了一些区域,以显示3D高斯点的显着减少。
1Introduction 1介绍
Simultaneous localization and mapping (SLAM) has been a fundamental computer vision problem with wide applications such as autonomous driving, robotics, and virtual/augmented reality [7, 28]. Several traditional methods, including ORBSLAM [24, 25], VINS [27], etc. [6, 37, 38], have been introduced over the years, representing scenes with sparse point cloud maps. However, due to the sparse nature of the point cloud, it proves ineffective for navigation or other purposes. Attention has turned to dense scene reconstruction, exemplified by DTAM [26], Kintinuous [35], and ElasticFusion [36]. However, their accuracy remains unsatisfactory due to high memory costs, slow processing speeds, and other real-time running limitations.
同时定位和映射(SLAM)一直是一个基本的计算机视觉问题,具有广泛的应用,如自动驾驶,机器人和虚拟/增强现实[ 7,28]。几种传统的方法,包括ORBSLAM [ 24,25],VINS [ 27]等[ 6,37,38],多年来已经引入,用稀疏点云图表示场景。然而,由于点云的稀疏性,它被证明对于导航或其他目的是无效的。注意力已经转向密集场景重建,例如DTAM [ 26],Kintinuous [ 35]和ElasticFusion [ 36]。然而,由于内存成本高、处理速度慢和其他实时运行限制,它们的精度仍然不令人满意。
Nowadays, with the proposal of Neural Radiance Fields (NeRF) [22], there are many following works on different areas [4]. Many works focus on combining implicit scene representation with SLAM systems. iMAP [32] is the first method to use a single MLP to represent the scene. NICE-SLAM [45], ESLAM [11], Co-SLAM [34], and PLGSLAM [5] further improve the scene representation with the hybrid feature grids, axis-aligned feature planes, joint coordinate-parametric encoding, and progressive scene representation. To further improve the accuracy of rendering, recent methods have started to explore 3D Gaussian Splatting(GS) [13] integration with SLAM, such as SplaTAM [12], GS-SLAM [39], etc [42, 21]. GS-based SLAM methods leverage a point-based representation associated with 3D Gaussian attributes and adopt the rasterization pipeline to render the images, achieving fast rendering speed and promising image quality. However, the original GS-based scene representation entails a substantial number of 3D Gaussian ellipsoids to maintain high-fidelity reconstruction, leading to high memory usage and storage requirements. GS-based SLAM systems usually need more than 500MB to represent a small room-sized scene. Moreover, the running speed of GS-based SLAM systems is significantly slower than NeRF-based methods, which hinders practical deployment, especially on resource-constrained devices.
如今,随着神经辐射场(NeRF)的提出[ 22],在不同领域有许多以下工作[ 4]。许多工作集中于将隐式场景表示与SLAM系统相结合。iMAP [ 32]是第一种使用单个MLP来表示场景的方法。NICE-SLAM [ 45],ESLAM [ 11],Co-SLAM [ 34]和PLGSLAM [ 5]进一步改进了混合特征网格,轴对齐特征平面,联合坐标参数编码和渐进式场景表示的场景表示。为了进一步提高渲染的准确性,最近的方法已经开始探索3D高斯溅射(GS)[ 13]与SLAM的集成,例如SplaTAM [ 12],GS-SLAM [ 39]等[ 42,21]。基于高斯的SLAM方法利用与3D高斯属性相关联的基于点的表示,并采用光栅化流水线来渲染图像,从而实现快速的渲染速度和有希望的图像质量。 然而,原始的基于GS的场景表示需要大量的3D高斯椭球来保持高保真度重建,从而导致高的内存使用和存储要求。基于GS的SLAM系统通常需要超过500MB来表示一个小房间大小的场景。此外,基于GS的SLAM系统的运行速度明显慢于基于NeRF的方法,这阻碍了实际部署,特别是在资源受限的设备上。
To this end, we propose a compact 3D Gaussian scene representation method to address the critical high memory demand and slow training speed issue in GS-based SLAM systems. Our method notably enhances storage efficiency while delivering high-quality reconstruction, fast training speed, and real-time rendering capabilities. First, we design a novel sliding window-based online masking method to remove the millions of redundant and unnecessary 3D Gaussian ellipsoids created during the SLAM system operation. With the proposed masking method, a compact 3D Gaussian scene representation is learned, achieving faster rendering speed and efficient memory usage since the computational complexity is linearly proportional to the number of 3D Gaussian points.