LangSplat: 3D Language Gaussian Splatting

Minghan Qin,Wanhua Li,Jiawei Zhou,Haoqian Wang,Hanspeter Pfister
2024-03-31
Abstract:Humans live in a 3D world and commonly use natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experimental results show that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a 199 $\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We strongly recommend readers to check out our video results at this https URL
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing the problem of precise and efficient querying using natural language in three-dimensional (3D) scenes. Specifically, the research team proposed a new method called LangSplat, aiming to overcome the limitations of existing technologies in terms of speed and accuracy. Existing methods (such as LERF) typically use Neural Radiance Fields (NeRF) to represent 3D scenes and construct 3D language fields by extracting features from pre-trained vision-language models (e.g., CLIP). However, these methods face challenges when dealing with large-scale, diverse 3D scene data, especially in the absence of corresponding language annotations. Moreover, while NeRF can provide powerful 3D representation capabilities, its volume rendering-based technology results in high computational costs, limiting efficiency and effectiveness in practical applications. To address the above issues, the LangSplat method adopts the following strategies: 1. **3D Gaussian Splatting**: LangSplat utilizes 3D Gaussian splatting technology to represent 3D scenes. This method not only enables efficient rendering but also achieves real-time rendering at higher resolutions. 2. **Scene-Specific Autoencoder**: To alleviate the memory demands brought by explicit modeling, LangSplat introduces a scene-specific autoencoder that compresses CLIP embeddings into a low-dimensional space, thereby reducing the memory required for each 3D Gaussian language feature. 3. **Learning Hierarchical Semantics**: To handle point ambiguity issues, LangSplat leverages the Segment Anything Model (SAM) to define and learn hierarchical semantics at different scales, ensuring that each 3D point can obtain precise semantic information. Experimental results show that LangSplat significantly outperforms previous methods in 3D open vocabulary querying tasks, particularly in 3D object localization and semantic segmentation tasks. Notably, LangSplat is 199 times faster than the current state-of-the-art methods, achieving extremely high efficiency at a resolution of 1440x1080. In summary, by combining 3D Gaussian splatting technology and scene-specific autoencoders, and utilizing SAM to learn hierarchical semantics, LangSplat effectively addresses the speed and accuracy issues present in 3D language field modeling.