GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

Yansong Qu,Shaohui Dai,Xinyang Li,Jianghang Lin,Liujuan Cao,Shengchuan Zhang,Rongrong Ji
2024-07-27
Abstract:3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. Our approach includes an efficient compression method that utilizes scene priors to condense noisy high-dimensional semantic features into compact low-dimensional vectors, which are subsequently embedded in 3DGS. During the open-vocabulary querying process, we adopt a distinct approach compared to existing methods, which depend on a manually set fixed empirical threshold to select regions based on their semantic feature distance to the query text embedding. This traditional approach often lacks universal accuracy, leading to challenges in precisely identifying specific target areas. Instead, our method treats the feature selection process as a hyperplane division within the feature space, retaining only those features that are highly relevant to the query. We leverage off-the-shelf 2D Referring Expression Segmentation (RES) models to fine-tune the semantic-space hyperplane, enabling a more precise distinction between target regions and others. This fine-tuning substantially improves the accuracy of open-vocabulary queries, ensuring the precise localization of pertinent 3D Gaussians. Extensive experiments demonstrate GOI's superiority over previous state-of-the-art methods. Our project page is available at <a class="link-external link-https" href="https://quyans.github.io/GOI-Hyperplane/" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the problem of 3D open-vocabulary scene understanding, specifically how to interpret and locate specific areas in three-dimensional space based on natural language instructions. Specifically, the research team proposed a framework called GOI (3D Gaussians of Interest), which combines semantic features from 2D vision-language foundation models with 3D Gaussian Splatting technology to identify 3D Gaussian bodies of interest. The key contributions of GOI include: 1. **Innovative Approach**: GOI proposes a novel method to tackle the problem of 3D open-vocabulary scene understanding, based on 3D Gaussian Splatting technology, and introduces an Optimizable Semantic-space Hyperplane (OSH) to precisely select features most relevant to the query text. 2. **Efficient Feature Compression**: To overcome the computational overhead of directly embedding high-dimensional semantic features into each 3D Gaussian body, GOI introduces a Trainable Feature Clustering Codebook (TFCC), which effectively compresses noisy high-dimensional features into low-dimensional vectors while maintaining the integrity of the information. 3. **Improved Feature Selection Strategy**: Traditional methods based on fixed empirical thresholds lack universal accuracy in relative feature selection. Therefore, GOI adopts a different approach by using an optimizable semantic-space hyperplane to achieve more precise feature selection, ensuring accurate identification of target areas. 4. **Performance Improvement**: Through extensive experiments, GOI improved the mean Intersection over Union (mIoU) by 30% on the Mip-NeRF360 dataset and by 12% on the Replica dataset, demonstrating its superiority over existing methods. In summary, GOI aims to solve the problem of 3D open-vocabulary scene understanding through an innovative and efficient approach, making significant progress in accurately locating specific target areas.