GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time
GGRt:实时无姿态可推广的3D高斯溅射
11平等贡献平等贡献在百度VIS实习完成的工作在百度VIS实习完成的工作Yuanyuan Gao 高媛媛1111Chenming Wu 吴晨鸣2211Dingwen Zhang 张定文11Corresponding AuthorCorresponding Author
11通讯作者通讯作者
Yalun Dai 戴亚伦33Chen Zhao 陈曌22Haocheng Feng 冯浩成22Errui Ding 丁二瑞22
Jingdong Wang 王敬东22Junwei Han 韩俊伟11
Abstract 摘要 [2403.10147] GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time
This paper presents GGRt, a novel approach to generalizable novel view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in real-world scenarios. Specifically, we design a novel joint learning framework that consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model. With the joint learning mechanism, the proposed framework can inherently estimate robust relative pose information from the image observations and thus primarily alleviate the requirement of real camera poses. Moreover, we implement a deferred back-propagation mechanism that enables high-resolution training and inference, overcoming the resolution constraints of previous methods. To enhance the speed and efficiency, we further introduce a progressive Gaussian cache module that dynamically adjusts during training and inference. As the first pose-free generalizable 3D-GS framework, GGRt achieves inference at ≥ 5 FPS and real-time rendering at ≥ 100 FPS. Through extensive experimentation, we demonstrate that our method outperforms existing NeRF-based pose-free techniques in terms of inference speed and effectiveness. It can also approach the real pose-based 3D-GS methods. Our contributions provide a significant leap forward for the integration of computer vision and computer graphics into practical applications, offering state-of-the-art results on LLFF, KITTI, and Waymo Open datasets and enabling real-time rendering for immersive experiences. Project page: GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time.
本文提出了一种新的方法,广义的新的视图合成,简化了真实的相机的姿态,在处理高分辨率图像的复杂性,和冗长的优化过程,从而促进了更强的适用性的三维高斯飞溅(3D-GS)在现实世界的场景。具体来说,我们设计了一种新的联合学习框架,它由迭代姿态优化网络(IPO-Net)和可推广的3D-Gaussians(G-3DG)模型组成。通过联合学习机制,该框架能够从图像观测中估计出鲁棒的相对姿态信息,从而初步缓解了对摄像机真实的姿态的要求。此外,我们实现了一种延迟反向传播机制,可以实现高分辨率的训练和推理,克服了以前方法的分辨率限制。 为了提高速度和效率,我们进一步引入了渐进式高斯缓存模块,在训练和推理过程中动态调整。作为第一个姿态自由的可推广的3D-GS框架,GGRt实现了 ≥ 5 FPS的推理和 ≥ 100 FPS的实时渲染。通过大量的实验,我们证明了我们的方法优于现有的NeRF为基础的姿势免费技术的推理速度和有效性。它也可以接近基于真实的姿态的3D-GS方法。我们的贡献为将计算机视觉和计算机图形集成到实际应用中提供了重大飞跃,在LLFF、KITTI和Waymo Open数据集上提供了最先进的结果,并实现了沉浸式体验的实时渲染。项目页面:https://3d-aigc.github.io/GGRt。
Keywords:
Pose-Free Generalizable 3D-GS Real-time Rendering关键词:无姿态泛化3D-GS实时绘制
1Introduction 1介绍
Recently invented Neural Radiance Fields (NeRF) [18] and 3D Gaussian Splatting (3D-GS) [11] bridge the gap between computer vision and computer graphics in the tasks of image-based novel view synthesis and 3D reconstruction. With a variety of follow-up variants, they are rapidly pushing the boundary towards revolutionizing many areas, such as virtual reality, film production, immersive entertainment, etc. To enhance generalization capabilities across previously unseen scenes, recent developments have introduced innovative approaches such as the generalizable NeRF [27] and 3D-GS [2].
最近发明的神经辐射场(NeRF)[ 18]和3D高斯溅射(3D-GS)[ 11]在基于图像的新视图合成和3D重建任务中弥合了计算机视觉和计算机图形学之间的差距。随着各种后续变体的出现,它们正在迅速推动许多领域的革命性发展,例如虚拟现实,电影制作,沉浸式娱乐等。为了增强以前看不见的场景的泛化能力,最近的发展引入了创新方法,例如可泛化的NeRF [ 27]和3D-GS [ 2]。
Figure 1:Our proposed GGRt stands for the first pose-free generalizable 3D Gaussian splatting approach, capable of inference at over 5 FPS, and delivering real-time rendering performance.
图一:我们提出的GGRt代表了第一个无姿态的可推广的3D高斯溅射方法,能够以超过5 FPS的速度进行推理,并提供实时渲染性能。
Despite their ability to reconstruct new scenes without optimization, the previous works usually rely on the actual camera pose for each image observation, which actually cannot always be captured accurately in real-world scenarios. Besides, these methods show unsatisfactory view synthesis performance and struggle to reconstruct at higher resolutions due to the large number of parameters used. Last but not least, for such methods, each time when synthesizing a novel view demands a complete forward pass of the whole network, making real-time rendering intractable.
尽管它们能够在不优化的情况下重建新场景,但以前的作品通常依赖于每次图像观察的实际相机姿势,实际上在现实世界的场景中并不总是准确地捕捉到。此外,这些方法显示不令人满意的视图合成性能和斗争,重建在更高的分辨率,由于大量的参数使用。最后但并非最不重要的是,对于此类方法,每次合成新视图时都需要整个网络的完整前向传递,使得实时渲染变得棘手。
To tackle these challenges, this paper proposes GGRt, which brings the benefits of a primitive-based 3D representation—fast and memory-efficient rendering—to the generalizable novel view synthesis under the pose-free condition. Specifically, we introduce a novel pipeline that jointly learns the IPO-Net and the G-3DG model. Such a pipeline can estimate relative camera pose information robustly and thus effectively alleviate the requirement for real camera poses. Subsequently, we develop a deferred back-propagation (DBP) mechanism, allowing our method to efficiently perform high-resolution training and inference, a capability that surpasses the low-resolution limitations of existing methods [12, 20, 9, 26]. Furthermore, we also design an innovative Gaussians cache module with the idea of reusing the relative pose information and image features of the reference views in two continuous training and inferencing iterations. Thus, the Gaussians cache can progressively grow and diminish throughout the training and inferencing processes, further accelerating the speed of both.
为了解决这些挑战,本文提出了GGRt,它带来了一个基于连续性的三维表示的好处-快速和内存高效的渲染-在姿态自由的条件下,可推广的新的视图合成。具体来说,我们引入了一种新的管道,共同学习IPO-Net和G-3DG模型。这样的流水线可以鲁棒地估计相对相机姿态信息,从而有效地减轻对真实的相机姿态的要求。随后,我们开发了一种延迟反向传播(DBP)机制,使我们的方法能够有效地执行高分辨率训练和推理,这种能力超越了现有方法的低分辨率限制[ 12,20,9,26]。此外,我们还设计了一个创新的高斯缓存模块的思想,重用的相对姿态信息和图像特征的参考视图在两个连续的训练和推理迭代。 因此,高斯缓存可以在整个训练和推理过程中逐渐增长和减少,进一步加快两者的速度。
To the best of our knowledge, our work stands for the first pose-free generalizable 3D Gaussian splatting, inference at ≥ 5 FPS, and rendering in real-time at ≥ 100 FPS. Extensive experiments demonstrate that our method surpasses existing NeRF-based pose-free approaches in inference speed and effectiveness. Compared to pose-based 3D-GS methods, our approach provides faster inference and competitive performance, even without the camera pose prior.
据我们所知,我们的工作代表了第一个姿态自由的可推广的3D高斯飞溅,在 ≥ 5 FPS的推理,并在 ≥ 100 FPS的实时渲染。大量的实验表明,我们的方法优于现有的NeRF为基础的姿势自由的方法在推理速度和有效性。与基于姿态的3D-GS方法相比,即使没有相机姿态先验,我们的方法也提供了更快的推理和有竞争力的性能。
2Related Work 2相关工作
2.1Generalizable Novel View Synthesis
2.1可推广的新视图合成
Pioneering approaches involving novel view synthesis leverage image-based rendering techniques, such as light field rendering [23, 21] and view interpolation [27, 32]. The introduction of NeRF [18] marks a significant milestone that uses neural networks to model the volume scene function and demonstrates impressive results in this task but requires per-scene optimization and accurate camera poses. To address the problem of generalization, researchers have explored several directions. For instance, PixelNeRF [34] presents a NeRF architecture that is conditioned on image inputs in a fully convolutional fashion. NeuRay [15] enhances the NeRF framework by predicting the visibility of 3D points relative to input views, allowing the radiance field construction to concentrate on visible image features. Furthermore, GNT [27] integrates multi-view geometry into an attention-based representation, which is then decoded through an attention mechanism in the view transformer for rendering novel views.
涉及新颖视图合成的开创性方法利用基于图像的渲染技术,例如光场渲染[23,21]和视图插值[27,32]。NeRF [ 18]的引入标志着一个重要的里程碑,它使用神经网络来建模体积场景函数,并在这项任务中展示了令人印象深刻的结果,但需要每个场景的优化和准确的相机姿势。为了解决泛化问题,研究人员已经探索了几个方向。例如,PixelNeRF [ 34]提出了一种NeRF架构,该架构以完全卷积的方式基于图像输入。NeuRay [ 15]通过预测相对于输入视图的3D点的可见性来增强NeRF框架,允许辐射场构建集中于可见图像特征。 此外,GNT [ 27]将多视图几何结构集成到基于注意力的表示中,然后通过视图Transformer中的注意力机制进行解码,以渲染新视图。
A recent work LRM [10] and its multi-view version [13], also adopt a transformer for generalizable scene reconstruction using either a single image or posed four images. However, those works only demonstrate the capability in object-centric scenes, while our work targets a more ambitious goal of being generalizable in both indoor and outdoor scenes. Fu et al. [4] propose to use a generalizable neural field from posed RGB images and depth maps, eschewing a fusion module. Our work, in contrast, requires only camera input without pose information.
最近的工作LRM [ 10]及其多视图版本[ 13]也采用Transformer,用于使用单个图像或构成的四个图像进行可推广的场景重建。然而,这些作品只展示了以对象为中心的场景的能力,而我们的工作则针对一个更雄心勃勃的目标,即在室内和室外场景中推广。Fu等人。[ 4]建议使用来自构成的RGB图像和深度图的可泛化神经场,避免融合模块。相比之下,我们的工作只需要相机输入,而不需要姿势信息。
The aforementioned works use implicit representation inherited from NeRF and its variants, showing slow training and inferencing speed. Differently, pixleSplat [2] is the first generalizable 3D-GS work that tackles the problem of synthesizing novel views between a pair of images. However, it still requires accurate poses and only supports a pair of images as inputs. Instead, our work dismisses the demand for image poses and supports large-scale scene inference with unlimited images as reference views.
上述作品使用继承自NeRF及其变体的隐式表示,显示出缓慢的训练和推理速度。显然,pixleSplat [ 2]是第一个可推广的3D-GS工作,它解决了在一对图像之间合成新视图的问题。然而,它仍然需要精确的姿势,并且只支持一对图像作为输入。相反,我们的工作驳回了图像构成的需求,并支持大规模的场景推理与无限的图像作为参考意见。
2.2Pose-free Modeling for Novel View Synthesis
2.2用于新视图合成的姿态自由建模
The first attempt towards pose-free novel view synthesis is iNeRF [33], which uses key-point matching to predict camera poses. NeRF– [31] proposes to optimize camera pose embeddings and NeRF jointly. [14] proposes to learn neural 3D representations and register camera frames using coarse-to-fine positional encodings. [1] integrates scale and shift-corrected monocular depth priors to train their model, enabling the joint acquisition of relative poses between successive frames and novel view synthesis of the scenes. [16] employs a strategy that synergizes pre-trained depth and optical-flow priors. This approach is used to progressively refine blockwise NeRFs, facilitating the frame-by-frame recovery of camera poses.
第一次尝试无姿态的新视图合成是iNeRF [ 33],它使用关键点匹配来预测相机姿态。NeRF- [ 31]建议联合优化相机姿势嵌入和NeRF。[ 14]提出学习神经3D表示并使用粗到细的位置编码来注册相机帧。[ 1]集成了缩放和偏移校正的单目深度先验来训练它们的模型,从而能够联合获取连续帧之间的相对姿态和场景的新颖视图合成。[ 16]采用一种策略,协同预先训练的深度和光流先验。这种方法用于逐步细化块式NeRF,促进相机姿态的逐帧恢复。
The implicit modeling inherent to NeRF complicates the simultaneous optimization of scene and camera poses. However, the recent innovation of 3D-GS provides an explicit point-based scene representation, enabling real-time rendering and highly efficient optimization. A recent work [5] pushes the boundary of simultaneous scene and pose optimization. However, those approaches need tremendous efforts in training and optimization per scene.
NeRF固有的隐式建模使场景和相机姿势的同时优化变得复杂。然而,3D-GS的最新创新提供了显式的基于点的场景表示,从而实现实时渲染和高效优化。最近的工作[ 5]推动了同时场景和姿势优化的边界。然而,这些方法需要在每个场景的训练和优化方面付出巨大的努力。
In generalizable settings, SRT [19], VideoAE [12], RUST [20], MonoNeRF [26], DBARF [3] and FlowCam [22] learn a generalizable scene representation from unposed videos using NeRF’s implicit representation. Those works show unsatisfactory view synthesis performance without per-scene optimization and inherent all the drawbacks NeRF originally had, such as real-time rendering of explicit primitives. PF-LRM [28] extends LRM to be applicable in pose-free scenes by using a differentiable PnP solver, but it shows the same limitations of LRM [10] mentioned above. To the best of our knowledge, our work stands for the first pose-free generalizable 3D-GS that enables efficient inferencing and real-time rendering, exhibiting SOTA performance in various metrics compared to previous approaches.
在可概括的设置中,SRT [ 19],VideoAE [ 12],RUST [ 20],MonoNeRF [ 26],DBARF [ 3]和FlowCam [ 22]使用NeRF的隐式表示从无姿势视频中学习可概括的场景表示。这些工作表现出不令人满意的视图合成性能,没有每场景优化和固有的所有缺点NeRF原来有,如显式图元的实时渲染。PF-LRM [ 28]通过使用可微Pendash求解器将LRM扩展为适用于无姿态场景,但它显示了与上述LRM [ 10]相同的限制。据我们所知,我们的工作代表了第一个无姿态可推广的3D-GS,它可以实现高效的推理和实时渲染,与以前的方法相比,在各种指标上表现出SOTA性能。