Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

Jun Guo,Xiaojian Ma,Yue Fan,Huaping Liu,Qing Li
2024-08-23
Abstract:Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, with wide-ranging applications in embodied agents and augmented reality systems. Existing methods adopt neurel rendering methods as 3D representations and jointly optimize color and semantic features to achieve rendering and scene understanding simultaneously. In this paper, we introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Our key idea is to distill knowledge from 2D pre-trained models to 3D Gaussians. Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, which is based on spatial relationship and need no additional training. We further build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference. The quantitative results on ScanNet segmentation and LERF object localization demonstates the superior performance of our method. Additionally, we explore several applications of Semantic Gaussians including object part segmentation, instance segmentation, scene editing, and spatiotemporal segmentation with better qualitative results over 2D and 3D baselines, highlighting its versatility and effectiveness on supporting diverse downstream tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of open-vocabulary 3D scene understanding. Specifically, the research goal is to understand and interpret information within 3D scenes given a scenario, rather than being limited to predefined object categories. Open-vocabulary 3D scene understanding allows machines to interact with the environment in natural language, enabling tasks such as object recognition, semantic scene reconstruction, and navigation in complex environments. To achieve this goal, the researchers propose the Semantic Gaussians method, a novel open-vocabulary scene understanding approach based on 3D Gaussian lattices. Unlike existing methods, this approach designs a flexible projection method that maps various 2D semantic features from a pre-trained image encoder onto new semantic components of the 3D Gaussian lattice, without additional training. Furthermore, the paper constructs a 3D semantic network that can directly predict semantic components from the original 3D Gaussian lattice, enabling fast inference. Quantitative results on the ScanNet segmentation and LERF object localization datasets demonstrate the superior performance of this method. Additionally, the paper explores the performance of Semantic Gaussians in multiple application areas, including object part segmentation, instance segmentation, scene editing, and spatiotemporal segmentation, showcasing its versatility and effectiveness.