Abstract:Zero-shot neural scene segmentation, which reconstructs 3D neural segmentation field without manual annotations, serves as an effective way for scene understanding. However, existing models, especially the efficient 3D Gaussian-based methods, struggle to produce compact segmentation results. This issue stems primarily from their redundant learnable attributes assigned on individual Gaussians, leading to a lack of robustness against the 3D-inconsistencies in zero-shot generated raw labels. To address this problem, our work, named Compact Segmented 3D Gaussians (CoSegGaussians), proposes the Feature Unprojection and Fusion module as the segmentation field, which utilizes a shallow decoder generalizable for all Gaussians based on high-level features. Specifically, leveraging the learned Gaussian geometric parameters, semantic-aware image-based features are introduced into the scene via our unprojection technique. The lifted features, together with spatial information, are fed into the multi-scale aggregation decoder to generate segmentation identities for all Gaussians. Furthermore, we design CoSeg Loss to boost model robustness against 3D-inconsistent noises. Experimental results show that our model surpasses baselines on zero-shot semantic segmentation task, improving by ~10% mIoU over the best baseline. Code and more results will be available at <a class="link-external link-https" href="https://David-Dou.github.io/CoSegGaussians" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of compactness in zero - shot neural scene segmentation. Specifically, the existing methods based on 3D Gaussian models perform poorly in generating compact segmentation results, mainly because these methods assign redundant learnable attributes to each Gaussian distribution, resulting in poor robustness to 3D inconsistency. ### Specific description of the problem 1. **Redundant attributes lead to over - fitting**: Existing methods directly assign a large number of learnable attributes to each Gaussian distribution, which makes the model prone to over - fitting to the label noise generated from different views, especially when there is inconsistency in the original labels generated in zero - shot. 2. **3D inconsistency**: Since labels may vary across different views, the existing 3D - Gaussian - based methods have difficulty handling this cross - view inconsistency, thus affecting the compactness and accuracy of the segmentation results. ### Solution To address the above problems, the authors propose a new method named CoSegGaussians. The main innovations of this method include: 1. **Feature Unprojection and Fusion module**: By introducing a shallow decoder to generalize the representation of all Gaussian distributions, redundant parameters are reduced. This module utilizes high - level semantically - aware image features and spatial information to generate the segmentation identity of each Gaussian distribution. 2. **Efficient back - projection technique**: An explicit inverse rendering method is proposed to efficiently introduce semantically - aware features in 2D images into 3D Gaussian representations, avoiding time - consuming high - dimensional feature rendering and significantly reducing the number of learnable parameters. 3. **CoSeg Loss**: A new loss function is designed to enhance the model's robustness to 3D - inconsistent noise. This loss function includes pixel - level label - noise - robustness loss and 3D regularization loss, ensuring the compactness and consistency of the segmentation results. ### Experimental results The experimental results show that CoSegGaussians outperforms the existing baseline methods in the zero - shot semantic segmentation task, with an improvement of about 10% in the mIoU metric. In addition, this method also performs well in terms of inference speed, especially being about 20 times faster than the NeRF - based methods. ### Summary This paper effectively solves the problem of insufficient compactness in the existing 3D - Gaussian - model - based zero - shot neural scene segmentation by proposing the CoSegGaussians method, improving the quality of the segmentation results and the efficiency of the model.

Learning Segmented 3D Gaussians via Efficient Feature Unprojection for Zero-shot Neural Scene Segmentation

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

2D-Guided 3D Gaussian Segmentation

Gaga: Group Any Gaussians via 3D-aware Memory Bank

Contrastive Gaussian Clustering: Weakly Supervised 3D Scene Segmentation

TSGaussian: Semantic and Depth-Guided Target-Specific Gaussian Splatting from Sparse Views

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation

GradiSeg: Gradient-Guided Gaussian Segmentation with Enhanced 3D Boundary Precision

A Meaningful Learning Method for Zero-Shot Semantic Segmentation

Delving into Shape-aware Zero-shot Semantic Segmentation

Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding

LeanGaussian: Breaking Pixel or Point Cloud Correspondence in Modeling 3D Gaussians

DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

Learning to Segment Unseen Category Objects Using Gradient Gaussian Attention.

Zero-shot Unsupervised Transfer Instance Segmentation

Doubly Deformable Aggregation of Covariance Matrices for Few-shot Segmentation

Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks

Zero-Shot Point Cloud Segmentation by Semantic-Visual Aware Synthesis.