GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

Yansong Qu,Shaohui Dai,Xinyang Li,Jianghang Lin,Liujuan Cao,Shengchuan Zhang,Rongrong Ji

2024-07-27

Abstract:3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. Our approach includes an efficient compression method that utilizes scene priors to condense noisy high-dimensional semantic features into compact low-dimensional vectors, which are subsequently embedded in 3DGS. During the open-vocabulary querying process, we adopt a distinct approach compared to existing methods, which depend on a manually set fixed empirical threshold to select regions based on their semantic feature distance to the query text embedding. This traditional approach often lacks universal accuracy, leading to challenges in precisely identifying specific target areas. Instead, our method treats the feature selection process as a hyperplane division within the feature space, retaining only those features that are highly relevant to the query. We leverage off-the-shelf 2D Referring Expression Segmentation (RES) models to fine-tune the semantic-space hyperplane, enabling a more precise distinction between target regions and others. This fine-tuning substantially improves the accuracy of open-vocabulary queries, ensuring the precise localization of pertinent 3D Gaussians. Extensive experiments demonstrate GOI's superiority over previous state-of-the-art methods. Our project page is available at <a class="link-external link-https" href="https://quyans.github.io/GOI-Hyperplane/" rel="external noopener nofollow">this https URL</a> .

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily focuses on addressing the problem of 3D open-vocabulary scene understanding, specifically how to interpret and locate specific areas in three-dimensional space based on natural language instructions. Specifically, the research team proposed a framework called GOI (3D Gaussians of Interest), which combines semantic features from 2D vision-language foundation models with 3D Gaussian Splatting technology to identify 3D Gaussian bodies of interest. The key contributions of GOI include: 1. **Innovative Approach**: GOI proposes a novel method to tackle the problem of 3D open-vocabulary scene understanding, based on 3D Gaussian Splatting technology, and introduces an Optimizable Semantic-space Hyperplane (OSH) to precisely select features most relevant to the query text. 2. **Efficient Feature Compression**: To overcome the computational overhead of directly embedding high-dimensional semantic features into each 3D Gaussian body, GOI introduces a Trainable Feature Clustering Codebook (TFCC), which effectively compresses noisy high-dimensional features into low-dimensional vectors while maintaining the integrity of the information. 3. **Improved Feature Selection Strategy**: Traditional methods based on fixed empirical thresholds lack universal accuracy in relative feature selection. Therefore, GOI adopts a different approach by using an optimizable semantic-space hyperplane to achieve more precise feature selection, ensuring accurate identification of target areas. 4. **Performance Improvement**: Through extensive experiments, GOI improved the mean Intersection over Union (mIoU) by 30% on the Mip-NeRF360 dataset and by 12% on the Replica dataset, demonstrating its superiority over existing methods. In summary, GOI aims to solve the problem of 3D open-vocabulary scene understanding through an innovative and efficient approach, making significant progress in accurately locating specific target areas.

GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding

Semantic-oriented 3D model classification and retrieval using Gaussian processes

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

Occam's LGS: A Simple Approach for Language Gaussian Splatting

TSGaussian: Semantic and Depth-Guided Target-Specific Gaussian Splatting from Sparse Views

SLGaussian: Fast Language Gaussian Splatting in Sparse Views

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians

GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields

GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting

GS3LAM: Gaussian Semantic Splatting SLAM

LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding

3D Vision-Language Gaussian Splatting

HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding

SparseLGS: Sparse View Language Embedded Gaussian Splatting

Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding