Authors:
(1) Kun Lan, University of Science and Technology of China;
(2) Haoran Li, University of Science and Technology of China;
(3) Haolin Shi, University of Science and Technology of China;
(4) Wenjun Wu, University of Science and Technology of China;
(5) Yong Liao, University of Science and Technology of China;
(6) Lin Wang, AI Thrust, HKUST(GZ);
(7) Pengyuan Zhou, University of Science and Technology of China.
3. Method and 3.1. Point-Based rendering and Semantic Information Learning
3.2. Gaussian Clustering and 3.3. Gaussian Filtering
4. Experiment
4.1. Setups, 4.2. Result and 4.3. Ablations
3D Gaussian, a recently proposed explicit representation method, has attained remarkable achievements in three-dimensional scene reconstruction [1]. Its biggest advantage is the capability of real-time rendering. Utilizing a series of scene images and corresponding camera data, it employs 3D Gaussians to depict scene objects. Each 3D Gaussian is defined by parameters including mean, covariance matrix, opacity, and spherical harmonics. The mean pinpoints the Gaussian’s central position in the 3D scene. Expressed by a scaling matrix S and a rotation matrix R, the covariance matrix describes the Gaussian’s size and shape, while the spherical harmonics encode its color information. Gaussian Splatting then utilizes point-based rendering for efficient 3D to 2D projection.
Recent developments have seen numerous advancements in Gaussian Splatting. Innovations like DreamGaussian [11] and GaussianDreamer [12] merge this technique with Diffusion model [13], facilitating text-to-3D generation. 4D Gaussian Splatting [14] extends these methods to dynamic scene representation and rendering. Focusing on segmentation, Gaussian Grouping [7] and SAGA [8] have made significant strides. They both employ the Segment Anything Model (SAM) [15] to derive 2D prior segmentation data, guiding the learning of added semantic information in 3D Gaussians. In Gaussian Grouping, this information is conveyed similarly to coefficients of spherical harmonic functions, whereas SAGA uses learnable low-dimensional features. However, SAM’s reliance on geometric structures limits its semantic inclusivity in each mask. Thus, both methods propose strategies to ensure consistency of SAM’s segmentation outcomes from various perspectives. Gaussian Grouping treats images from different angles as a sequence of video frames, utilizing a pre-trained model for mask propagation and matching. In contrast, SAGA consolidates consistent, multi-granularity segmentation information across viewpoints, employing a custom-designed SAM-guidance loss.
3D Segmentation in Radiance Fields. Prior to the advent of 3D Gaussians, NeRF [5] stood as a prominent method in 3D characterization, sparking a plethora of derivative works [10, 16–21], including several focusing on decomposing and segmenting NeRF. A notable example is Object NeRF [17], which introduced a dual-pathway neural radiance field adept at object decomposition. Its scene branch processes spatial coordinates and viewing directions, outputting density and color details of a point from the viewer’s perspective, primarily encoding the background of the 3D scene and offering geometric context for the object branch. Uniquely, the object branch, in addition to spatial and directional inputs, integrates a learnable object activation code, enabling the independent learning of neural radiance fields for each scene object. And the 3D guard msak helps mitigate occlusion issues between objects during the learning phase. Similarly, Switch-NeRF [21] demonstrates the decomposition of large-scale neural radiance fields through a trainable gating network.
DM-NeRF [16] introduces an object field for NeRF segmentation, using it to generate a one-hot vector indicating mentation, using it to generate a one-hot vector indicating the ownership of each spatial point by an object. SPInNeRF [20] employs a semantic radiance field, assessing the likelihood of scene locations being associated with specific objects. ISRF [10] adds semantic features to specific points and incorporates DINO [22] features of rendered images into this framework through a teacher-student model, allowing for feature interpolation at any given point. Techniques such as K-means clustering, nearest neighbor matching, and bilateral search are integrated, enabling interactive NeRF segmentation. Additionally, OR-NeRF[18] chooses to back-project 2D segmentation results into a 3D space, propagating them across different viewpoints, and then re-rendering them onto a 2D plane.
These 3D Gaussian and NeRF segmentation methods either take a long time or struggle to preserve the detailed features of the scene in the segmentation result. For this reason, we propose a method that can segment multiple objects while preserving the detailed features in a short time.