Gaussian Grouping: Segment and Edit Anything in 3D Scenes
文章目录
1. What
What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)
The first 3D Gaussian-based approach to jointly reconstruct and segment anything in the open-world 3D scene.
Each Gaussian with a compact Identity Encoding, supervised by 2D masks by SAM along with introduced 3D spatial consistency regularization, can also be further used for editing.
-
Explanation of Open-world
An open-world scenario refers to an uncertain, dynamic and complex environment that contains a variety of objects, scenes and tasks.
Or “open-world scene understanding” refers to the ability of a model to generalize to scenes or environments that it has not been explicitly trained on. In this context, the term “open-world” implies that the model needs to be able to adapt to and understand a wide range of scenes, including those that may be very different from the scenes in its training data.
2. Why
Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)
Maybe contain Background, Question, Others, Innovation:
- Existing methods [8, 37] rely on manually-labeled datasets or require accurately scanned 3D point clouds [33, 42] as input.
- Existing NeRFs-based methods [14, 17, 25, 39] are computation-hungry and hard to adjust for the downstream task because the learned neural networks, such as MLPs, cannot decompose each part or module in the 3D scene easily
- As for Radiance-based Open World Scene Understanding: Unlike our approach, most of these methods are designed for in-domain scene modeling and cannot generalize to open-world scenarios.
3. How
Following this pipeline, we will introduce it in details.
3.1 Anything Mask Input and Consistency
Shown in Figure 2(a), a set of multi-view captures along with the automatically generated 2D segmentations by SAM, as well as the corresponding cameras calibrated via SfM are inputs.
Shown in Figure 2(b), to assign each 2D mask a unique ID in the 3D scene, a well-trained zero-shot tracker [7] was used to propagate and associate masks. Use colors to represent different segmentation labels, and the results are shown in Figure 2(b)
3.2 3D Gaussian Rendering and Grouping
Shown in Figure 2©, all of the core concepts of this paper were used.
-
Identity Encoding
A new parameter, i.e., Identity Encoding is introduced to each Gaussian with original S Θ i = { p i , s i , q i , α i , c i } S_{\Theta_{i}}=\{\mathbf{p}_{i},\mathbf{s}_{i},\mathbf{q}_{i},\alpha_{i},\mathbf{c}_{i}\} SΘi={ pi,si,qi,αi,ci}. It is a compact vector of length 16 and similar to Spherical Harmonic (SH) coefficients in representing color, it is differentiable and learnable.
-
Grouping via Rendering
In the process of rendering labels, similar to α \alpha