OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies

Runnan Chen,Xiangyu Sun,Zhaoqing Wang,Youquan Liu,Jiepeng Wang,Lingdong Kong,Jiankang Deng,Mingming Gong,Liang Pan,Wenping Wang,Tongliang Liu
2024-12-31
Abstract:Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose \textbf{OVGaussian}, a generalizable \textbf{O}pen-\textbf{V}ocabulary 3D semantic segmentation framework based on the 3D \textbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed \textbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (<a class="link-external link-https" href="https://github.com/runnanchen/OVGaussian" rel="external noopener nofollow">this https URL</a>).
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing 3D scene understanding methods in open - vocabulary queries. In particular, most of these methods can only perform open - vocabulary queries in specific training scenarios and lack the generalization ability to unseen new scenes. Specifically, current methods usually adopt the "lift and adapt" strategy, extracting features from large - scale 2D visual models and projecting them onto 3D Gaussian representations, which leads to the following problems: 1. **Scene Limitation**: These methods are usually limited to specific training scenes and cannot be well generalized to new, unseen 3D scenes. 2. **Lack of Geometric Context**: 2D projections cannot fully capture 3D spatial relationships, thus affecting accurate 3D spatial understanding. 3. **Insufficient Multimodal Data Integration**: There is a lack of a unified framework to effectively integrate multimodal data and maintain semantic consistency between the 2D and 3D domains. To solve these problems, the paper proposes a new method - OVGaussian (Open - Vocabulary Gaussian), aiming to achieve open - vocabulary segmentation across scenes through 3D Gaussian representations and has the following features: - **SegGaussian Dataset**: A large - scale 3D scene dataset SegGaussian is constructed, which contains 288 3D Gaussian scenes, and each scene is annotated with detailed semantic and instance labels. - **Generalizable Semantic Rasterization (GSR)**: Use 3D neural networks to learn and predict the semantic properties of each 3D Gaussian point, and these semantic properties can be rendered into multi - view - consistent 2D semantic maps. - **Cross - modal Consistency Learning (CCL)**: Use the open - vocabulary annotations in SegGaussian to train 3D neural networks, enabling them to perform open - vocabulary segmentation in Gaussian - based 3D scenes. The CCL framework enhances the open - vocabulary segmentation ability of the model by aligning the semantic properties of 2D images and 3D Gaussians with text embeddings. Experimental results show that OVGaussian significantly outperforms baseline methods in the open - vocabulary segmentation task, demonstrating its strong generalization ability across scenes, across domains, and under new perspectives. In summary, the main contributions of this paper include: - Proposing the SegGaussian dataset, which provides rich 3D Gaussian scenes and their semantic annotations. - Introducing Generalizable Semantic Rasterization (GSR), enabling 3D Gaussian representations to be generalized across scenes. - Designing Cross - modal Consistency Learning (CCL), aligning 3D Gaussians with 2D maps and text embeddings, enhancing the open - vocabulary segmentation ability. - Achieving state - of - the - art performance in the open - vocabulary segmentation task, demonstrating strong generalization ability.