Depth-Regularized Optimization for 3D Gaussian Splatting in Few-Shot Images
基于深度正则化的少镜头图像三维高斯溅射优化
郑在义 1 郑泽吴 2 李京慕 1,2
1Department of ECE, ASRI, Seoul National University, Seoul, Korea
1 韩国首尔,首尔国立大学,ASRI,ECE系
2IPAI, ASRI, Seoul National University, Seoul, Korea
2 IPAI,ASRI,首尔国立大学,首尔,韩国
{robot0321, ohjtgood, kyoungmu}@snu.ac.kr
@snu.ac.kr
Abstract 摘要 [2311.13398] Depth-Regularized Optimization for 3D Gaussian Splatting in Few-Shot Images
In this paper, we present a method to optimize Gaussian splatting with a limited number of images while avoiding overfitting. Representing a 3D scene by combining numerous Gaussian splats has yielded outstanding visual quality. However, it tends to overfit the training views when only a small number of images are available. To address this issue, we introduce a dense depth map as a geometry guide to mitigate overfitting. We obtained the depth map using a pre-trained monocular depth estimation model and aligning the scale and offset using sparse COLMAP feature points. The adjusted depth aids in the color-based optimization of 3D Gaussian splatting, mitigating floating artifacts, and ensuring adherence to geometric constraints. We verify the proposed method on the NeRF-LLFF dataset with varying numbers of few images. Our approach demonstrates robust geometry compared to the original method that relies solely on images.
在本文中,我们提出了一种方法来优化有限数量的图像,同时避免过拟合的高斯飞溅。通过组合众多高斯splats来表示3D场景,产生了出色的视觉质量。然而,当只有少量图像可用时,它往往会过拟合训练视图。为了解决这个问题,我们引入了一个密集的深度图作为几何指导,以减轻过拟合。我们使用预训练的单目深度估计模型获得深度图,并使用稀疏COLMAP特征点对齐尺度和偏移。调整后的深度有助于3D高斯飞溅的基于颜色的优化,减轻浮动伪影,并确保遵守几何约束。我们在NeRF-LLFF数据集上验证了所提出的方法,这些数据集具有不同数量的少量图像。与仅依赖于图像的原始方法相比,我们的方法展示了强大的几何形状。
1Introduction 1介绍
Figure 1:The efficacy of depth regularization in a few-shot setting We optimize Gaussian splats with a limited number of images, avoiding overfitting through the geometry guidance estimated from the images. Please note that we utilized only two images to create this 3D scene.
图一:深度正则化在少数镜头设置中的功效我们用有限数量的图像优化高斯splats,通过从图像估计的几何指导避免过拟合。请注意,我们只使用了两个图像来创建这个3D场景。
Reconstruction of three-dimensional space from images has long been a challenge in the computer vision field. Recent advancements show the feasibility of photorealistic novel view synthesis [3, 31], igniting research into reconstructing a complete 3D space from images. Driven by progress in computer graphics techniques and industry demand, particularly in sectors such as virtual reality [14] and mobile [11], research on achieving high-quality and high-speed real-time rendering has been ongoing. Among the recent notable developments, 3D Gaussian Splatting (3DGS) [23] stands out through its combination of high quality, rapid reconstruction speed, and support for real-time rendering. 3DGS employs Gaussian attenuated spherical harmonic splats [38, 12] with opacity as primitives to represent every part of a scene. It guides the splats to construct a consistent geometry by imposing a constraint on the splats to satisfy multiple images at the same time.
从图像重建三维空间一直是计算机视觉领域的一个挑战。最近的进展显示了真实感新颖视图合成的可行性[3,31],引发了从图像重建完整3D空间的研究。在计算机图形技术进步和行业需求的推动下,特别是在虚拟现实[14]和移动的[11]等领域,实现高质量和高速实时渲染的研究一直在进行中。在最近的显着发展中,3D高斯溅射(3DGS)[23]通过其高质量,快速重建速度和支持实时渲染的组合脱颖而出。3DGS采用高斯衰减球谐splats [38,12]与不透明度作为图元来表示场景的每个部分。它通过对splats施加约束来引导splats构造一致的几何形状,以同时满足多个图像。
The approach of aggregating small splats for a scene provides the capability to express intricate details, yet it is prone to overfitting due to its local nature. 3DGS [24] optimizes independent splats according to multi-view color supervision without global structure. Therefore, in the absence of a sufficient quantity of images that can offer a global geometric cue, there exists no precaution against overfitting. This issue becomes more pronounced as the number of images used for optimizing a 3D scene is small. The limited geometric information from a few number of images leads to an incorrect convergence toward a local optimum, resulting in optimization failure or floating artifacts as shown in Figure 1. Nevertheless, the capability to reconstruct a 3D scene with a restricted number of images is crucial for practical applications, prompting us to tackle the few-shot optimization problem.
为场景聚合小splats的方法提供了表达复杂细节的能力,但由于其局部性质,它易于过拟合。3DGS [24]根据没有全局结构的多视图颜色监督优化独立splats。因此,在没有足够数量的图像可以提供全局几何线索的情况下,不存在针对过拟合的预防措施。当用于优化3D场景的图像数量较少时,该问题变得更加明显。来自少数图像的有限几何信息导致向局部最优值的不正确收敛,从而导致优化失败或浮动伪影,如图1所示。然而,具有有限数量的图像重建3D场景的能力对于实际应用至关重要,这促使我们解决少镜头优化问题。
One intuitive solution is to supplement an additional geometric cue such as depth. In numerous 3D reconstruction contexts [6], depth proves immensely valuable for reconstructing 3D scenes by providing direct geometric information. To obtain such robust geometric cues, depth sensors aligned with RGB cameras are employed. Although these devices offer dense depth maps with minimal error, the necessity for such equipment also presents obstacles to practical applications.
一个直观的解决方案是补充额外的几何线索,如深度。在许多3D重建背景下[6],深度证明通过提供直接的几何信息来重建3D场景非常有价值。为了获得这种鲁棒的几何线索,采用与RGB相机对准的深度传感器。虽然这些设备提供了具有最小误差的密集深度图,但对这种设备的必要性也对实际应用构成了障碍。
Hence, we attain a dense depth map by adjusting the output of the depth estimation network with a sparse depth map from the renowned Structure-from-Motion (SfM), which computes the camera parameters and 3D feature points simultaneously. 3DGS also uses SfM, particularly COLMAP [41], to acquire such information. However, the SfM also encounters a notable scarcity in the available 3D feature points when the number of images is few. The sparse nature of the point cloud also makes it impractical to regularize all Gaussian splats. Hence, a method for inferring dense depth maps is essential. One of the methods to extract dense depth from images is by utilizing monocular depth estimation models. While these models are able to infer dense depth maps from individual images based on priors obtained from the data, they produce only relative depth due to scale ambiguity. Since the scale ambiguity leads to critical geometry conflicts in multi-view images, we need to adjust scales to prevent conflicts between independently inferred depths. We show that this can be done by fitting a sparse depth, which is a free output from COLMAP [41] to an estimated dense depth map.
因此,我们通过调整深度估计网络的输出来获得密集的深度图,该深度图具有来自著名的运动恢复结构(SfM)的稀疏深度图,该深度图同时计算相机参数和3D特征点。3DGS还使用SfM,特别是COLMAP [41]来获取这些信息。然而,SfM也遇到了一个显着的稀缺性,在可用的3D特征点时,图像的数量很少。点云的稀疏性质也使得正则化所有高斯splats变得不切实际。因此,用于推断密集深度图的方法是必不可少的。从图像中提取密集深度的方法之一是利用单目深度估计模型。虽然这些模型能够根据从数据中获得的先验知识从单个图像中推断出密集的深度图,但由于尺度模糊性,它们仅产生相对深度。 由于尺度模糊性导致多视图图像中的关键几何冲突,因此我们需要调整尺度以防止独立推断的深度之间的冲突。我们表明,这可以通过拟合稀疏深度来完成,这是从COLMAP [41]到估计的密集深度图的自由输出。
In this paper, we propose a method to represent 3D scenes using a small number of RGB images leveraging prior information from a pre-trained monocular depth estimation model [5] and a smoothness constraint. We adapt the scale and offset of the estimated depth to the sparse COLMAP points, solving the scale ambiguity. We use the adjusted depth as a geometry guide to assist color-based optimization, reducing floating artifacts and satisfying geometry conditions. We observe that even the revised depth helps guide the scene to geometrically optimal solution despite its roughness. We prevent the overfitting problem by incorporating an early stop strategy, where the optimization process stops when the depth-guide loss starts to rise. Moreover, to achieve more stability, we apply a smoothness constraint, ensuring that neighbor 3D points have similar depths. We adopt 3DGS as our baseline and compare the performance of our method in the NeRF-LLFF [30] dataset. We confirm that our strategy leads to plausible results not only in terms of RGB novel-view synthesis but also 3D geometry reconstruction. Through further experiments, we demonstrate the influence of geometry cues such as depth and initial points on Gaussian splatting. They significantly influence the stable optimization of Gaussian splatting.
在本文中,我们提出了一种使用少量RGB图像来表示3D场景的方法,该方法利用来自预训练的单目深度估计模型[5]和平滑度约束的先验信息。我们适应的规模和偏移量的估计深度的稀疏COLMAP点,解决规模模糊。我们使用调整后的深度作为几何指导,以协助基于颜色的优化,减少浮动工件和满足几何条件。我们观察到,即使是修改后的深度有助于引导场景的几何最佳解决方案,尽管其粗糙度。我们通过引入早期停止策略来防止过拟合问题,当深度引导损失开始上升时,优化过程停止。此外,为了实现更高的稳定性,我们应用了平滑约束,确保相邻3D点具有相似的深度。我们采用3DGS作为我们的基线,并比较我们的方法在NeRF-LLFF [30]数据集中的性能。 我们确认,我们的策略导致合理的结果,不仅在RGB新视图合成,但也3D几何重建。通过进一步的实验,我们证明了几何线索,如深度和初始点的高斯飞溅的影响。它们显著影响高斯溅射的稳定优化。
In summary, our contributions are as follows:
我们的贡献概括如下:
∙
We propose depth-guided Gaussian Splatting optimization strategy which enables optimizing the scene with a few images, mitigating over-fitting issue. We demonstrate that even an estimated depth adjusted with a sparse point cloud, which is an outcome of the SfM pipeline, can play a vital role in geometric regularization.
我们提出了深度引导的高斯溅射优化策略,该策略可以用少量图像优化场景,减轻过拟合问题。我们证明,即使是估计的深度调整与稀疏点云,这是一个结果的SfM管道,可以发挥至关重要的作用,在几何正则化。
∙
We present a novel early stop strategy: halting the training process when depth-guided loss suffers to drop. We illustrate the influence of each strategy through thorough ablation studies.
我们提出了一种新的早期停止策略:当深度引导损失下降时停止训练过程。我们通过彻底的消融研究来说明每种策略的影响。
∙
We show that the adoption of a smoothness term for the depth map directs the model to finding the correct geometry. Comprehensive experiments reveal enhanced performance attributed to the inclusion of a smoothness term.
我们表明,通过平滑项的深度图指导模型找到正确的几何形状。综合实验表明,增强的性能归因于包括一个平滑项。
2Related Work 2相关工作
Novel view synthesis 一种新的视图合成方法
Structure from motion (SfM) [46] and Multi-view stereo (MVS) [45] are techniques for reconstructing 3D structures using multiple images, which have been studied for a long time in the computer vision field. Among the continuous developments, COLMAP [41] is a widely used representative tool. COLMAP performs camera pose calibration and finds sparse 3D keypoints using the epipolar constraint [22] of multi-view images. For more dense and realistic reconstruction, deep learning based 3D reconstruction techniques have been mainly studied. [21,