InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

Haijie Li,Yanmin Wu,Jiarui Meng,Qiankun Gao,Zhiyao Zhang,Ronggang Wang,Jian Zhang
2024-11-29
Abstract:3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation. In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include: i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation and training strategies. Project page: <a class="link-external link-https" href="https://lhj-git.github.io/InstanceGaussian/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address three key challenges faced when using 3D Gaussian Splatting (3DGS) in 3D scene understanding: 1. **Imbalance between Appearance and Semantics**: To capture the fine - grained texture details of a given object or region, multiple Gaussian distributions with different appearance attributes are required. However, only one shared semantic attribute is needed for these Gaussian distributions to accurately represent the semantics of this region. This imbalance results in the number of Gaussian distributions being sufficient for appearance representation but redundant for semantic expression. 2. **Consistency Problem between Appearance and Semantics**: In pure appearance reconstruction, a single Gaussian distribution can represent different objects or regions. Especially at object boundaries, a single Gaussian distribution may represent both the foreground and background of an object. Some recent methods (such as Gaussian rendering techniques based on language embedding) adopt a decoupled learning strategy, ignoring the interdependence between color and semantics, leading to consistency problems between appearance and semantics, which pose significant challenges to 3D point segmentation and 2D image segmentation. 3. **Difficulty in Top - Down Instance Segmentation**: Previous methods are mainly designed in a top - down manner and usually rely on predefined class information. For example, GaussianGrouping defines the number of instances based on 2D tracking results, FastLGS determines the number of objects through the number of matches in a cross - view setting, and OpenGaussian depends on the number of predefined codebook entries. These methods are prone to over - segmentation or under - segmentation problems when dealing with fine - grained instances in complex scenes, especially when the class distribution is uneven. To solve these problems, the authors propose a new method named InstanceGaussian, which can jointly learn the appearance and semantic features of objects and adaptively aggregate instances. Specifically, the main contributions of the paper include: 1. **Semantic - Scaffold - GS Representation**: A new representation method is proposed. By more flexibly allocating the semantic and appearance attributes of Gaussian distributions, it balances appearance and semantics, thereby improving the accuracy of object geometric boundaries and the quality of feature representation. 2. **Progressive Appearance - Semantic Joint Training Strategy**: A progressive training strategy is proposed to gradually optimize the joint representation of appearance and semantics, ensuring its consistency throughout the training process, thereby enhancing the stability of the model and the accuracy of downstream tasks (such as 3D point segmentation and 2D image segmentation). 3. **Bottom - Up, Class - Independent Instance Aggregation**: A bottom - up, class - independent instance aggregation method is introduced. By using clustering algorithms of farthest point sampling and connected component analysis, it effectively avoids over - segmentation or under - segmentation problems. Through these improvements, InstanceGaussian achieves state - of - the - art performance in the open - vocabulary 3D point - level segmentation task, verifying the effectiveness of the proposed method.