View-Consistent 3D Editing with Gaussian Splatting
使用高斯溅射进行视图一致的3D编辑
Abstract 摘要
View-Consistent 3D Editing with Gaussian Splatting
The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing, offering efficient, high-fidelity rendering and enabling precise local manipulations. Currently, diffusion-based 2D editing models are harnessed to modify multi-view rendered images, which then guide the editing of 3DGS models. However, this approach faces a critical issue of multi-view inconsistency, where the guidance images exhibit significant discrepancies across views, leading to mode collapse and visual artifacts of 3DGS. To this end, we introduce View-consistent Editing (VcEdit), a novel framework that seamlessly incorporates 3DGS into image editing processes, ensuring multi-view consistency in edited guidance images and effectively mitigating mode collapse issues. VcEdit employs two innovative consistency modules: the Cross-attention Consistency Module and the Editing Consistency Module, both designed to reduce inconsistencies in edited images. By incorporating these consistency modules into an iterative pattern, VcEdit proficiently resolves the issue of multi-view inconsistency, facilitating high-quality 3DGS editing across a diverse range of scenes.
3D高斯溅射(3DGS)的出现彻底改变了3D编辑,提供了高效、高保真的渲染,并实现了精确的局部操作。目前,基于扩散的2D编辑模型被用来修改多视图渲染图像,然后指导3DGS模型的编辑。然而,这种方法面临着多视图不一致的关键问题,其中引导图像在视图之间表现出显著的差异,导致模式崩溃和3DGS的视觉伪影。为此,我们引入了视图一致性编辑(VcEdit),这是一种将3DGS无缝集成到图像编辑过程中的新型框架,可确保编辑的指导图像中的多视图一致性,并有效缓解模式崩溃问题。VcEdit采用了两个创新的一致性模块:交叉注意一致性模块和编辑一致性模块,两者都旨在减少编辑图像中的不一致性。 通过将这些一致性模块整合到迭代模式中,VcEdit熟练地解决了多视图不一致的问题,促进了在各种场景中进行高质量的3DGS编辑。
Keywords:
3D Editing 3D Gaussian Splating Multi-view Consistency关键词:3D编辑3D高斯拼接多视图一致性
Figure 1:Capability highlight of our method: VcEdit. Given a source 3D Gaussian Splatting and a user-specified text prompt, our VcEdit enables versatile scene and object editing. By ensuring multi-view consistent image guidance, VcEdit alleviates artifacts and excels in high-quality editing.
图1:我们的方法的功能亮点:VcEdit。给定源3D高斯飞溅和用户指定的文本提示,我们的VcEdit可以实现多功能场景和对象编辑。通过确保多视图一致的图像引导,VcEdit消除了伪影,并在高质量编辑方面表现出色。
1Introduction 1介绍
We consider the problem of text-driven 3D model editing: given a source 3D model and user-specified text instructions, the task is to modify the 3D model according to the instructions, as depicted in Fig. 1, ensuring both editing fidelity and preservation of essential source content [34, 5, 7, 4, 18]. This problem holds paramount importance across a variety of industrial applications, such as real-time outfit changes for 3D digital humans and immersive AR/VR interactive environments[39, 3, 42]. Recently, groundbreaking 3D Gaussian Splatting (3DGS) [19] has emerged as a promising “silver bullet” for 3D editing, notable for its efficient, high-fidelity rendering and explicit representation (3D anisotropic balls known as Gaussians) suitable for local manipulation.
我们考虑文本驱动的3D模型编辑问题:给定源3D模型和用户指定的文本指令,任务是根据指令修改3D模型,如图1所示,确保编辑保真度和保留基本源内容[ 34,5,7,4,18]。这个问题在各种工业应用中至关重要,例如3D数字人和沉浸式AR/VR交互环境的实时服装变化[ 39,3,42]。最近,突破性的3D高斯溅射(3DGS)[ 19]已经成为3D编辑的一个有前途的“银弹”,以其高效,高保真渲染和显式表示(3D各向异性球称为高斯)而闻名,适合局部操作。
In the context of editing 3DGS models, thanks to recent progress in large-scale pre-trained 2D diffusion models, existing methods [15, 8, 9, 6, 10] leverage off-the-shelf 2D editing models [12, 14, 28, 38, 11, 1] to guide optimization of the 3DGS model. As shown in Fig. 2(a), this pattern renders source 3DGS into multi-view 2D images, manipulates them via 2D editing models using text prompts, and then employs these adjusted images to fine-tune the original 3DGS. Beyond achieving plausible editing outcomes, this image-based pattern also facilitates user-friendly interaction, enabling users to pre-select their preferred edited images and personalize the editing workflow.
在编辑3DGS模型的背景下,由于大规模预训练2D扩散模型的最新进展,现有方法[15,8,9,6,10]利用现成的2D编辑模型[12,14,28,38,11,1]来指导3DGS模型的优化。如图2(a)所示,该模式将源3DGS渲染成多视图2D图像,使用文本提示经由2D编辑模型操纵它们,然后采用这些调整后的图像来微调原始3DGS。除了实现合理的编辑结果外,这种基于图像的模式还促进了用户友好的交互,使用户能够预先选择他们喜欢的编辑图像并个性化编辑工作流程。
However, such image-guided 3DGS editing has a notorious multi-view inconsistency that cannot be ignored. Fig. 2(a) vividly illustrates that images edited separately using a state-of-the-art 2D editing model [38] manifest pronounced inconsistencies across views — the views of a man are edited to different styles of clowns. Utilizing these significantly varied edited images as guidance, the 3DGS model will struggle with the issue that few training images are coherent, whereas the majority display conflicting information. Unfortunately, the explicitness and inherent densification process of 3DGS make it especially vulnerable to multi-view inconsistency, which complicates 3DGS in densifying under-reconstruction regions or pruning over-reconstruction regions [35]. Consequently, training with the multi-view inconsistent guidance can lead to the mode collapse of 3DGS, characterized by the ambiguity between the source and target, as well as the flickering artifacts revealed in Fig. 2(a).
然而,这种图像引导的3DGS编辑具有不能忽视的臭名昭著的多视图不一致性。图2(a)生动地说明了使用最先进的2D编辑模型[ 38]单独编辑的图像在视图之间表现出明显的不一致性-一个人的视图被编辑为不同风格的小丑。利用这些显著变化的编辑图像作为指导,3DGS模型将努力解决很少有训练图像是一致的,而大多数显示冲突信息的问题。不幸的是,3DGS的显式性和固有的致密化过程使其特别容易受到多视图不一致的影响,这使得3DGS在致密化重建不足区域或修剪重建过度区域时变得复杂[ 35]。因此,使用多视图不一致引导的训练可能导致3DGS的模式崩溃,其特征在于源和目标之间的模糊性,以及图2(a)中揭示的闪烁伪影。
Thus, the crux lies in addressing the multi-view inconsistency of the image guidance. We conjecture such a problem stems from the lack of 3D awareness in 2D editing models; that is, they inherently process each view in isolation. Therefore, we introduce the View-consistent Editing (VcEdit), a high-quality image-guided 3DGS editing framework. This framework seamlessly incorporates 3DGS into image editing processes to achieve multi-view consistent guidance, thus effectively addressing the issue of 3DGS mode collapse. As illustrated in Fig. 2(b), VcEdit employs specially-designed multi-view consistency modules within an iterative pattern.
因此,关键在于解决图像引导的多视图不一致性。我们推测这样的问题源于2D编辑模型中缺乏3D意识;也就是说,它们固有地孤立地处理每个视图。因此,我们介绍了视图一致性编辑(VcEdit),一个高质量的图像引导的3DGS编辑框架。该框架将3DGS无缝集成到图像编辑过程中,实现多视图一致引导,从而有效解决了3DGS模式崩溃的问题。如图2(B)所示,VcEdit在迭代模式中采用专门设计的多视图一致性模块。
Primarily, we design two effective consistency modules using the explicit nature and fast rendering capability of 3DGS: (1) The Cross-attention Consistency Module (CCM) that consolidates the multi-view cross-attention maps in the diffusion-based image editing model, thus harmonizing the model’s attentive 3D region across all views. More concretely, this process inverse-renders the original cross-attention maps from all views onto each Gaussian within the source 3DGS, thereby creating an averaged 3D map. Subsequently, this 3D map is rendered back to 2D, serving as the consolidated cross-attention maps to replace the originals for more coherent edits. (2) The Editing Consistency Module (ECM) that directly calibrates the multi-view inconsistent editing outputs: We fine-tune a source-cloned 3DGS with the editing outputs and then render the 3DGS back to images. Taking advantage of the rapid rendering speed of the 3DGS, this mechanism efficiently decreases incoherent content in each edited image.
首先,我们设计了两个有效的一致性模块使用显式的性质和快速渲染能力的3DGS:(1)交叉注意一致性模块(CCM),巩固多视图交叉注意力地图在基于扩散的图像编辑模型,从而协调模型的关注3D区域在所有视图。更具体地,该过程将来自所有视图的原始交叉注意力图逆渲染到源3DGS内的每个高斯上,从而创建平均3D图。随后,这个3D地图被渲染回2D,作为合并的交叉注意力地图,以取代原始地图,进行更连贯的编辑。(2)编辑一致性模块(ECM)直接校准多视图不一致的编辑输出:我们使用编辑输出微调源克隆的3DGS,然后将3DGS渲染回图像。 利用3DGS的快速渲染速度,该机制有效地减少了每个编辑图像中的不相干内容。
To further mitigate the multi-view inconsistency issue, we extend our VcEdit to an iterative pattern: editing rendered images → updating the 3DGS → repeating. In situations where the image editing model yields overly inconsistent initial edits, it allows for the correction of initially inconsistent views in later iterations [15], which continuously refines the 3DGS and fosters a reciprocal cycle. Fig. 2(b) illustrates that the 3DGS of a “man” is iteratively guided to a consistent style that aligns with the desired “clown” target. To meet the demand for rapid iteration, VcEdit integrates InfEdit [38], a high-quality, fast image editing model that bypasses the lengthy DDIM-inversion phase. Depending on the complexity of the editing scene and user instructions, the processing time in our VcEdit for each sample ranged from 10 to 20 minutes.
为了进一步缓解多视图不一致问题,我们将VcEdit扩展为迭代模式:编辑渲染图像 → 更新3DGS → 重复。在图像编辑模型产生过度不一致的初始编辑的情况下,它允许在以后的迭代中纠正最初不一致的视图[ 15],这会不断细化3DGS并促进相互循环。图2(B)示出了“人”的3DGS被迭代地引导到与期望的“小丑”目标对准的一致样式。为了满足快速迭代的需求,VcEdit集成了InfEdit [ 38],这是一种高质量,快速的图像编辑模型,可以绕过冗长的DDIM反转阶段。根据编辑场景的复杂性和用户说明,每个样本在我们的VcEdit中的处理时间从10分钟到20分钟不等。