Abstract:Understanding dynamic 3D scenes is fundamental for various applications, including extended reality (XR) and autonomous driving. Effectively integrating semantic information into 3D reconstruction enables holistic representation that opens opportunities for immersive and interactive applications. We introduce SADG, Segment Any Dynamic Gaussian Without Object Trackers, a novel approach that combines dynamic Gaussian Splatting representation and semantic information without reliance on object IDs. In contrast to existing works, we do not rely on supervision based on object identities to enable consistent segmentation of dynamic 3D objects. To this end, we propose to learn semantically-aware features by leveraging masks generated from the Segment Anything Model (SAM) and utilizing our novel contrastive learning objective based on hard pixel mining. The learned Gaussian features can be effectively clustered without further post-processing. This enables fast computation for further object-level editing, such as object removal, composition, and style transfer by manipulating the Gaussians in the scene. We further extend several dynamic novel-view datasets with segmentation benchmarks to enable testing of learned feature fields from unseen viewpoints. We evaluate SADG on proposed benchmarks and demonstrate the superior performance of our approach in segmenting objects within dynamic scenes along with its effectiveness for further downstream editing tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to achieve multi - view - consistent semantic segmentation in dynamic 3D scenes without relying on object trackers. Specifically, the authors propose a new method - SADG (Segment Any Dynamic Gaussian Without Object Trackers), aiming to combine dynamic Gaussian point cloud representations and semantic information to achieve consistent segmentation of dynamic 3D objects. ### Main problems 1. **Semantic segmentation in dynamic 3D scenes**: - Existing methods usually rely on object trackers to provide consistent object mask IDs, but these methods are prone to inconsistency problems in multi - view scenes, resulting in the failure of the optimization pipeline or sub - optimal results. 2. **Efficient and real - time interactive editing**: - Existing methods are computationally intensive when dealing with dynamic scenes and need to re - render multiple views to ensure editing consistency, which limits their application in real - time interactive tasks. ### Solutions - **Semantic segmentation without trackers**: SADG avoids relying on object trackers by introducing a new contrastive learning objective and using masks generated from SAM (Segment Anything Model) to learn semantically - aware features. - **Efficient feature learning and clustering**: SADG uses 32 - dimensional compact Gaussian features and clusters them through the DBSCAN algorithm, so that the features can be effectively grouped without further post - processing. - **Extended data set**: To evaluate the effectiveness of the learned feature fields, the authors extended several dynamic new - view data sets and added segmentation benchmarks, so that the performance of the model can be tested on unseen views. ### Key contributions 1. **Proposing the SADG framework**: Achieved multi - view - consistent segmentation of dynamic scenes without tracking supervision. 2. **Innovative contrastive learning objective**: Utilized hard positive - negative sample mining techniques to learn semantically - aware latent representations from 2D masks. 3. **Extensive experimental verification**: Conducted extensive experiments in single - view and multi - view scenes, demonstrating superior performance on five dynamic new - view benchmarks. 4. **Application in downstream tasks**: Demonstrated the versatility of the learned feature space, including editing tasks such as object removal, style transfer, and scene composition. 5. **User - friendly interaction interface**: Provided tools that can edit scenes through simple mouse clicks or text prompts, which are easy to operate and real - time. Through these contributions, SADG not only solves the challenges of semantic segmentation in dynamic 3D scenes but also provides effective technical support for real - time interactive applications.

SADG: Segment Any Dynamic Gaussian Without Object Trackers

SAGD: Boundary-Enhanced Segment Anything in 3D Gaussian via Gaussian Decomposition

SAD: Segment Any RGBD

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

DGD: Dynamic 3D Gaussians Distillation

Segment Any 3D Gaussians

Segment Any 4D Gaussians

GradiSeg: Gradient-Guided Gaussian Segmentation with Enhanced 3D Boundary Precision

SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM

Gradient-Driven 3D Segmentation and Affordance Transfer in Gaussian Splatting Using 2D Masks

SA-GS: Semantic-Aware Gaussian Splatting for Large Scene Reconstruction with Geometry Constrain

EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

Distractor-free Generalizable 3D Gaussian Splatting

SemGauss-SLAM: Dense Semantic Gaussian Splatting SLAM

SLGaussian: Fast Language Gaussian Splatting in Sparse Views

DGS-SLAM: Gaussian Splatting SLAM in Dynamic Environment

Gaga: Group Any Gaussians via 3D-aware Memory Bank

2D-Guided 3D Gaussian Segmentation