Abstract:3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation. In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include: i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation and training strategies. Project page: <a class="link-external link-https" href="https://lhj-git.github.io/InstanceGaussian/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

This paper attempts to address three key challenges faced when using 3D Gaussian Splatting (3DGS) in 3D scene understanding: 1. **Imbalance between Appearance and Semantics**: To capture the fine - grained texture details of a given object or region, multiple Gaussian distributions with different appearance attributes are required. However, only one shared semantic attribute is needed for these Gaussian distributions to accurately represent the semantics of this region. This imbalance results in the number of Gaussian distributions being sufficient for appearance representation but redundant for semantic expression. 2. **Consistency Problem between Appearance and Semantics**: In pure appearance reconstruction, a single Gaussian distribution can represent different objects or regions. Especially at object boundaries, a single Gaussian distribution may represent both the foreground and background of an object. Some recent methods (such as Gaussian rendering techniques based on language embedding) adopt a decoupled learning strategy, ignoring the interdependence between color and semantics, leading to consistency problems between appearance and semantics, which pose significant challenges to 3D point segmentation and 2D image segmentation. 3. **Difficulty in Top - Down Instance Segmentation**: Previous methods are mainly designed in a top - down manner and usually rely on predefined class information. For example, GaussianGrouping defines the number of instances based on 2D tracking results, FastLGS determines the number of objects through the number of matches in a cross - view setting, and OpenGaussian depends on the number of predefined codebook entries. These methods are prone to over - segmentation or under - segmentation problems when dealing with fine - grained instances in complex scenes, especially when the class distribution is uneven. To solve these problems, the authors propose a new method named InstanceGaussian, which can jointly learn the appearance and semantic features of objects and adaptively aggregate instances. Specifically, the main contributions of the paper include: 1. **Semantic - Scaffold - GS Representation**: A new representation method is proposed. By more flexibly allocating the semantic and appearance attributes of Gaussian distributions, it balances appearance and semantics, thereby improving the accuracy of object geometric boundaries and the quality of feature representation. 2. **Progressive Appearance - Semantic Joint Training Strategy**: A progressive training strategy is proposed to gradually optimize the joint representation of appearance and semantics, ensuring its consistency throughout the training process, thereby enhancing the stability of the model and the accuracy of downstream tasks (such as 3D point segmentation and 2D image segmentation). 3. **Bottom - Up, Class - Independent Instance Aggregation**: A bottom - up, class - independent instance aggregation method is introduced. By using clustering algorithms of farthest point sampling and connected component analysis, it effectively avoids over - segmentation or under - segmentation problems. Through these improvements, InstanceGaussian achieves state - of - the - art performance in the open - vocabulary 3D point - level segmentation task, verifying the effectiveness of the proposed method.

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding

CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding

Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding

GradiSeg: Gradient-Guided Gaussian Segmentation with Enhanced 3D Boundary Precision

SLGaussian: Fast Language Gaussian Splatting in Sparse Views

GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

SemGauss-SLAM: Dense Semantic Gaussian Splatting SLAM

Occam's LGS: A Simple Approach for Language Gaussian Splatting

SA-GS: Semantic-Aware Gaussian Splatting for Large Scene Reconstruction with Geometry Constrain

SAGD: Boundary-Enhanced Segment Anything in 3D Gaussian via Gaussian Decomposition

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Query-based Semantic Gaussian Field for Scene Representation in Reinforcement Learning

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding

SparseLGS: Sparse View Language Embedded Gaussian Splatting

3D Vision-Language Gaussian Splatting

LineGS : 3D Line Segment Representation on 3D Gaussian Splatting

LeanGaussian: Breaking Pixel or Point Cloud Correspondence in Modeling 3D Gaussians