Abstract:Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose \textbf{OVGaussian}, a generalizable \textbf{O}pen-\textbf{V}ocabulary 3D semantic segmentation framework based on the 3D \textbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed \textbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (<a class="link-external link-https" href="https://github.com/runnanchen/OVGaussian" rel="external noopener nofollow">this https URL</a>).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing 3D scene understanding methods in open - vocabulary queries. In particular, most of these methods can only perform open - vocabulary queries in specific training scenarios and lack the generalization ability to unseen new scenes. Specifically, current methods usually adopt the "lift and adapt" strategy, extracting features from large - scale 2D visual models and projecting them onto 3D Gaussian representations, which leads to the following problems: 1. **Scene Limitation**: These methods are usually limited to specific training scenes and cannot be well generalized to new, unseen 3D scenes. 2. **Lack of Geometric Context**: 2D projections cannot fully capture 3D spatial relationships, thus affecting accurate 3D spatial understanding. 3. **Insufficient Multimodal Data Integration**: There is a lack of a unified framework to effectively integrate multimodal data and maintain semantic consistency between the 2D and 3D domains. To solve these problems, the paper proposes a new method - OVGaussian (Open - Vocabulary Gaussian), aiming to achieve open - vocabulary segmentation across scenes through 3D Gaussian representations and has the following features: - **SegGaussian Dataset**: A large - scale 3D scene dataset SegGaussian is constructed, which contains 288 3D Gaussian scenes, and each scene is annotated with detailed semantic and instance labels. - **Generalizable Semantic Rasterization (GSR)**: Use 3D neural networks to learn and predict the semantic properties of each 3D Gaussian point, and these semantic properties can be rendered into multi - view - consistent 2D semantic maps. - **Cross - modal Consistency Learning (CCL)**: Use the open - vocabulary annotations in SegGaussian to train 3D neural networks, enabling them to perform open - vocabulary segmentation in Gaussian - based 3D scenes. The CCL framework enhances the open - vocabulary segmentation ability of the model by aligning the semantic properties of 2D images and 3D Gaussians with text embeddings. Experimental results show that OVGaussian significantly outperforms baseline methods in the open - vocabulary segmentation task, demonstrating its strong generalization ability across scenes, across domains, and under new perspectives. In summary, the main contributions of this paper include: - Proposing the SegGaussian dataset, which provides rich 3D Gaussian scenes and their semantic annotations. - Introducing Generalizable Semantic Rasterization (GSR), enabling 3D Gaussian representations to be generalized across scenes. - Designing Cross - modal Consistency Learning (CCL), aligning 3D Gaussians with 2D maps and text embeddings, enhancing the open - vocabulary segmentation ability. - Achieving state - of - the - art performance in the open - vocabulary segmentation task, demonstrating strong generalization ability.

OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies

OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

SLGaussian: Fast Language Gaussian Splatting in Sparse Views

GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

2D-Guided 3D Gaussian Segmentation

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields

MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

Learning Segmented 3D Gaussians via Efficient Feature Unprojection for Zero-shot Neural Scene Segmentation

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

3D Vision-Language Gaussian Splatting

Gaga: Group Any Gaussians via 3D-aware Memory Bank

CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Contrastive Gaussian Clustering: Weakly Supervised 3D Scene Segmentation

PixelGaussian: Generalizable 3D Gaussian Reconstruction from Arbitrary Views