LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

Simon Boeder,Fabian Gigengack,Benjamin Risse
2024-07-25
Abstract:The 3D occupancy estimation task has become an important challenge in the area of vision-based autonomous driving recently. However, most existing camera-based methods rely on costly 3D voxel labels or LiDAR scans for training, limiting their practicality and scalability. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called LangOcc, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space, where ground-truth features can be computed. This training mechanism automatically supervises the scene geometry, allowing for a straight-forward and powerful training method without any explicit geometry supervision. LangOcc outperforms LiDAR-supervised competitors in open vocabulary occupancy by a large margin, solely relying on vision-based training. We also achieve state-of-the-art results in self-supervised semantic occupancy estimation on the Occ3D-nuScenes dataset, despite not being limited to a specific set of categories, thus demonstrating the effectiveness of our proposed vision-language training.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in the 3D occupancy estimation task in the field of autonomous driving. Specifically, the authors attempt to solve the following problems: 1. **Dependence on expensive 3D labels**: - Most of the existing camera - based methods rely on expensive 3D voxel labels or LiDAR scan data for training, which limits the practicality and scalability of these methods. Obtaining large - scale, high - quality 3D label data is both resource - consuming and impractical. 2. **Limited by predefined categories**: - Most of the existing 3D occupancy estimation methods can only detect a predefined set of categories, which limits their application ability in complex and dynamic environments. To overcome this limitation, a method that can handle arbitrary semantics, that is, open - vocabulary occupancy estimation, is required. 3. **Lack of self - supervised learning methods**: - Current methods usually require a large amount of labeled data for supervised learning, while self - supervised learning methods can be trained using only image data, thus avoiding dependence on 3D labels. Self - supervised learning can improve the generalization ability and adaptability of the model, especially in the absence of explicit geometric supervision. ### Overview of the solution To solve the above problems, the authors propose a new method named LangOcc, which has the following main features: - **Open - vocabulary occupancy estimation**: - LangOcc achieves open - vocabulary occupancy estimation by distilling the knowledge of the powerful vision - language alignment encoder CLIP into the 3D occupancy model. This means that it can detect arbitrary semantics, not just limited to predefined categories. - **Self - supervised learning**: - The model supervises the prediction of 3D features in the 2D image space through the differentiable volume rendering technique. This method automatically supervises the scene geometry without explicit geometric supervision, thus achieving efficient self - supervised training. - **Vision - language alignment features**: - The model estimates vision - language - aligned feature vectors on each voxel instead of predicting the probabilities of predefined categories. In this way, the model can represent arbitrary geometric structures and semantic information in 3D space. - **Efficient training and inference**: - By introducing feature subspace learning, the computational and memory overhead can be reduced for specific tasks while maintaining good performance. In summary, LangOcc provides an innovative self - supervised method that can train 3D occupancy estimation models using only image data and can handle arbitrary semantics in an open vocabulary, significantly improving the perception and understanding ability of autonomous driving systems in complex environments.