Learning Segmented 3D Gaussians via Efficient Feature Unprojection for Zero-shot Neural Scene Segmentation

Bin Dou,Tianyu Zhang,Zhaohui Wang,Yongjia Ma,Zejian Yuan
2024-07-28
Abstract:Zero-shot neural scene segmentation, which reconstructs 3D neural segmentation field without manual annotations, serves as an effective way for scene understanding. However, existing models, especially the efficient 3D Gaussian-based methods, struggle to produce compact segmentation results. This issue stems primarily from their redundant learnable attributes assigned on individual Gaussians, leading to a lack of robustness against the 3D-inconsistencies in zero-shot generated raw labels. To address this problem, our work, named Compact Segmented 3D Gaussians (CoSegGaussians), proposes the Feature Unprojection and Fusion module as the segmentation field, which utilizes a shallow decoder generalizable for all Gaussians based on high-level features. Specifically, leveraging the learned Gaussian geometric parameters, semantic-aware image-based features are introduced into the scene via our unprojection technique. The lifted features, together with spatial information, are fed into the multi-scale aggregation decoder to generate segmentation identities for all Gaussians. Furthermore, we design CoSeg Loss to boost model robustness against 3D-inconsistent noises. Experimental results show that our model surpasses baselines on zero-shot semantic segmentation task, improving by ~10% mIoU over the best baseline. Code and more results will be available at <a class="link-external link-https" href="https://David-Dou.github.io/CoSegGaussians" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of compactness in zero - shot neural scene segmentation. Specifically, the existing methods based on 3D Gaussian models perform poorly in generating compact segmentation results, mainly because these methods assign redundant learnable attributes to each Gaussian distribution, resulting in poor robustness to 3D inconsistency. ### Specific description of the problem 1. **Redundant attributes lead to over - fitting**: Existing methods directly assign a large number of learnable attributes to each Gaussian distribution, which makes the model prone to over - fitting to the label noise generated from different views, especially when there is inconsistency in the original labels generated in zero - shot. 2. **3D inconsistency**: Since labels may vary across different views, the existing 3D - Gaussian - based methods have difficulty handling this cross - view inconsistency, thus affecting the compactness and accuracy of the segmentation results. ### Solution To address the above problems, the authors propose a new method named CoSegGaussians. The main innovations of this method include: 1. **Feature Unprojection and Fusion module**: By introducing a shallow decoder to generalize the representation of all Gaussian distributions, redundant parameters are reduced. This module utilizes high - level semantically - aware image features and spatial information to generate the segmentation identity of each Gaussian distribution. 2. **Efficient back - projection technique**: An explicit inverse rendering method is proposed to efficiently introduce semantically - aware features in 2D images into 3D Gaussian representations, avoiding time - consuming high - dimensional feature rendering and significantly reducing the number of learnable parameters. 3. **CoSeg Loss**: A new loss function is designed to enhance the model's robustness to 3D - inconsistent noise. This loss function includes pixel - level label - noise - robustness loss and 3D regularization loss, ensuring the compactness and consistency of the segmentation results. ### Experimental results The experimental results show that CoSegGaussians outperforms the existing baseline methods in the zero - shot semantic segmentation task, with an improvement of about 10% in the mIoU metric. In addition, this method also performs well in terms of inference speed, especially being about 20 times faster than the NeRF - based methods. ### Summary This paper effectively solves the problem of insufficient compactness in the existing 3D - Gaussian - model - based zero - shot neural scene segmentation by proposing the CoSegGaussians method, improving the quality of the segmentation results and the efficiency of the model.