Abstract:The 3D occupancy estimation task has become an important challenge in the area of vision-based autonomous driving recently. However, most existing camera-based methods rely on costly 3D voxel labels or LiDAR scans for training, limiting their practicality and scalability. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called LangOcc, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space, where ground-truth features can be computed. This training mechanism automatically supervises the scene geometry, allowing for a straight-forward and powerful training method without any explicit geometry supervision. LangOcc outperforms LiDAR-supervised competitors in open vocabulary occupancy by a large margin, solely relying on vision-based training. We also achieve state-of-the-art results in self-supervised semantic occupancy estimation on the Occ3D-nuScenes dataset, despite not being limited to a specific set of categories, thus demonstrating the effectiveness of our proposed vision-language training.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in the 3D occupancy estimation task in the field of autonomous driving. Specifically, the authors attempt to solve the following problems: 1. **Dependence on expensive 3D labels**: - Most of the existing camera - based methods rely on expensive 3D voxel labels or LiDAR scan data for training, which limits the practicality and scalability of these methods. Obtaining large - scale, high - quality 3D label data is both resource - consuming and impractical. 2. **Limited by predefined categories**: - Most of the existing 3D occupancy estimation methods can only detect a predefined set of categories, which limits their application ability in complex and dynamic environments. To overcome this limitation, a method that can handle arbitrary semantics, that is, open - vocabulary occupancy estimation, is required. 3. **Lack of self - supervised learning methods**: - Current methods usually require a large amount of labeled data for supervised learning, while self - supervised learning methods can be trained using only image data, thus avoiding dependence on 3D labels. Self - supervised learning can improve the generalization ability and adaptability of the model, especially in the absence of explicit geometric supervision. ### Overview of the solution To solve the above problems, the authors propose a new method named LangOcc, which has the following main features: - **Open - vocabulary occupancy estimation**: - LangOcc achieves open - vocabulary occupancy estimation by distilling the knowledge of the powerful vision - language alignment encoder CLIP into the 3D occupancy model. This means that it can detect arbitrary semantics, not just limited to predefined categories. - **Self - supervised learning**: - The model supervises the prediction of 3D features in the 2D image space through the differentiable volume rendering technique. This method automatically supervises the scene geometry without explicit geometric supervision, thus achieving efficient self - supervised training. - **Vision - language alignment features**: - The model estimates vision - language - aligned feature vectors on each voxel instead of predicting the probabilities of predefined categories. In this way, the model can represent arbitrary geometric structures and semantic information in 3D space. - **Efficient training and inference**: - By introducing feature subspace learning, the computational and memory overhead can be reduced for specific tasks while maintaining good performance. In summary, LangOcc provides an innovative self - supervised method that can train 3D occupancy estimation models using only image data and can handle arbitrary semantics in an open vocabulary, significantly improving the perception and understanding ability of autonomous driving systems in complex environments.

LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Language Driven Occupancy Prediction

OccFlowNet: Towards Self-supervised Occupancy Estimation via Differentiable Rendering and Occupancy Flow

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

VEON: Vocabulary-Enhanced Occupancy Prediction

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation

RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

OVO: Open-Vocabulary Occupancy

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation