Gamba：将高斯溅射与Mamba结合用于单视图3D重建

最新推荐文章于 2025-06-21 14:11:49 发布

c2a2o2

最新推荐文章于 2025-06-21 14:11:49 发布

阅读量1.7k

点赞数 12

CC 4.0 BY-SA版权

文章标签： 3d

本文链接：https://blog.youkuaiyun.com/c2a2o2/article/details/138025532

Gamba: Marry Gaussian Splatting with Mamba for Single-View 3D Reconstruction
Gamba：将高斯溅射与Mamba结合用于单视图3D重建

Qiuhong Shen11 Xuanyu Yi31 Zike Wu31 Pan Zhou2,42 Hanwang Zhang3,5
沈秋红 1 易轩宇 3 吴子可 3 潘周 2,4 2 张汉旺 3,5Shuicheng Yan5 Xinchao Wang12
严水成 5 王新潮 1 2
1National University of Singapore 2Singapore Management University
1 新加坡国立大学 2 新加坡管理大学
3Nanyang Technological University 4Sea AI Lab 5Skywork AI
3 南洋理工大学 4 Sea AI Lab 5 Skywork AI

Abstract 摘要 Gamba: Marry Gaussian Splatting with Mamba for Single-View 3D Reconstruction

We tackle the challenge of efficiently reconstructing a 3D asset from a single image with growing demands for automated 3D content creation pipelines. Previous methods primarily rely on Score Distillation Sampling (SDS) and Neural Radiance Fields (NeRF). Despite their significant success, these approaches encounter practical limitations due to lengthy optimization and considerable memory usage. In this report, we introduce Gamba, an end-to-end amortized 3D reconstruction model from single-view images, emphasizing two main insights: (1) 3D representation: leveraging a large number of 3D Gaussians for an efficient 3D Gaussian splatting process; (2) Backbone design: introducing a Mamba-based sequential network that facilitates context-dependent reasoning and linear scalability with the sequence (token) length, accommodating a substantial number of Gaussians. Gamba incorporates significant advancements in data preprocessing, regularization design, and training methodologies. We assessed Gamba against existing optimization-based and feed-forward 3D generation approaches using the real-world scanned OmniObject3D dataset. Here, Gamba demonstrates competitive generation capabilities, both qualitatively and quantitatively, while achieving remarkable speed, approximately 0.6 second on a single NVIDIA A100 GPU.
随着对自动化3D内容创建管道的需求不断增长，我们应对了从单个图像有效重建3D资产的挑战。以前的方法主要依赖于分数蒸馏采样（SDS）和神经辐射场（NeRF）。尽管这些方法取得了显著的成功，但由于冗长的优化和相当大的内存使用，这些方法遇到了实际的限制。在这份报告中，我们介绍了Gamba，一种基于单视图图像的端到端摊销3D重建模型，强调了两个主要观点：（1）3D表示：利用大量3D高斯进行高效的3D高斯溅射过程;（2）主干设计：引入了基于Mamba的顺序网络，该网络有助于上下文相关的推理和序列（令牌）长度的线性可扩展性，容纳了大量的高斯人Gamba在数据预处理、正则化设计和训练方法方面取得了重大进展。我们评估了Gamba对现有的基于优化和前馈的3D生成方法，使用真实世界的扫描OmniObject 3D数据集。在这里，Gamba展示了具有竞争力的生成能力，无论是质量还是数量，同时在单个NVIDIA A100 GPU上实现了约0.6秒的卓越速度。

3Work in progress, partially done in Sea AI Lab and 2050 Research, Skywork AI
正在进行的工作，部分在Sea AI Lab和2050 Research，Skywork AI完成

1Introduction 一、导言

We tackle the challenge of efficiently extracting a 3D asset from a single image, an endeavor with substantial implications across diverse industrial sectors. This endeavor facilitates AR/VR content generation from a single snapshot and aids in the development of autonomous vehicle path planning through monocular perception Sun et al. (2023); Gul et al. (2019); Yi et al. (2023).
我们解决了从单个图像中有效提取3D资产的挑战，这是一项对不同工业部门具有重大影响的奋进。这一奋进有助于从单个快照生成AR/VR内容，并有助于通过单目感知开发自动驾驶车辆路径规划Sun et al.（2023）; Gul et al.（2019）; Yi et al.（2023）。

Previous approaches to single-view 3D reconstruction have mainly been achieved through Score Distillation Sampling (SDS) Poole et al. (2022), which leverages pre-trained 2D diffusion models Graikos et al. (2022); Rombach et al. (2022) to guide optimization of the underlying representations of 3D assets. These optimization-based approaches have achieved remarkable success, known for their high-fidelity and generalizability. However, they require a time-consuming per-instance optimization process Tang (2022); Wang et al. (2023d); Wu et al. (2024) to generate a single object and also suffer from artifacts such as the “multi-face” problem arising from bias in pre-trained 2D diffusion models Hong et al. (2023a). On the other hand, previous approaches predominantly utilized neural radiance fields (NeRF) Mildenhall et al. (2021); Barron et al. (2021), which are equipped with high-dimensional multi-layer perception (MLP) and inefficient volume rendering Mildenhall et al. (2021). This computational complexity significantly limits practical applications on limited compute budgets. For instance, the Large reconstruction Model (LRM) Hong et al. (2023b) is confined to a resolution of 32 using a triplane-NeRF Shue et al. (2023) representation, and the resolution of renderings is limited to 128 due to the bottleneck of online volume rendering.
以前的单视图3D重建方法主要是通过分数蒸馏采样（SDS）Poole et al.（2022）实现的，该方法利用预先训练的2D扩散模型Graikos et al.（2022）; Rombach et al.（2022）来指导3D资产底层表示的优化。这些基于优化的方法已经取得了显著的成功，以其高保真度和通用性而闻名。然而，它们需要耗时的每个实例优化过程Tang（2022）; Wang等人（2023 d）; Wu等人（2024）来生成单个对象，并且还遭受伪影，例如由预训练的2D扩散模型中的偏差引起的“多面”问题Hong等人（2023 a）。另一方面，以前的方法主要利用神经辐射场（NeRF）Mildenhall et al.（2021）;巴伦et al.（2021），其配备了高维多层感知（MLP）和低效的体绘制Mildenhall et al.（2021）。这种计算复杂性极大地限制了有限计算预算的实际应用。例如，大型重建模型（LRM）Hong等人（2023 b）使用三平面NeRF Shue等人（2023）表示被限制为32的分辨率，并且由于在线体绘制的瓶颈，渲染的分辨率被限制为128。

Refer to caption

Figure 1:(a): We propose Gamba, an end-to-end, feed-forward single-view reconstruction pipeline, which marries 3D Gaussian Splatting with Mamba to achieve fast reconstruction. (b): The relationship between the 3DGS generation process and the Mamba sequential predicting pattern.
图1：（a）：我们提出了Gamba，一个端到端的前馈单视图重建管道，它将3D高斯溅射与Mamba结合在一起，以实现快速重建。(b)3DGS生成过程与Mamba序列预测模式之间的关系。

To address these challenges and thus achieve efficient single-view 3D reconstruction, we are seeking an amortized generative framework with the groundbreaking 3D Gaussian Splatting, notable for its memory-efficient and high-fidelity tiled rendering Kerbl et al. (2023); Zwicker et al. (2002); Chen & Wang (2024); Wang et al. (2024). Despite recent exciting progress Tang et al. (2023), how to properly and immediately generate 3D Gaussians remains a less studied topic. Recent prevalent 3D amortized generative models Hong et al. (2023b); Wang et al. (2023b); Xu et al. (2024; 2023); Zou et al. (2023); Li et al. (2023) predominantly use transformer-based architecture as their backbones Vaswani et al. (2017); Peebles & Xie (2023), but we argue that these widely used architectures are sub-optimal for generating 3DGS. The crucial challenge stems from the fact that 3DGS requires a sufficient number of 3D Gaussians to accurately represent a 3D model or scene. However, the spatio-temporal complexity of Transformers increases quadratic-ally with the number of tokens Vaswani et al. (2017), which limits the expressiveness of the 3DGS due to the insufficient token counts for 3D Gaussians. Furthermore, the 3DGS parameters possess specific physical meanings, making the simultaneous generation of 3DGS parameters a more challenging task.
为了解决这些挑战，从而实现有效的单视图3D重建，我们正在寻求一个具有开创性的3D高斯溅射的摊销生成框架，以其高效的内存和高保真度平铺渲染而闻名Kerbl et al.（2023）; Zwicker et al.（2002）; Chen & Wang（2024）; Wang et al.（2024）。尽管最近取得了令人兴奋的进展，但如何正确和立即生成3D高斯仍然是一个研究较少的话题。最近流行的3D摊销生成模型Hong et al.（2023 b）; Wang et al.（2023 b）; Xu et al.（2024; 2023）; Zou et al.（2023）; Li et al.（2023）主要使用基于transformer的架构作为其主干Vaswani et al.（2017）;皮布尔斯和谢（2023），但我们认为，这些广泛使用的架构是次优的生成3DG