Abstract:We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold. First, we design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Second, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual language annotations. Finally, we demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes. You can find the project page here

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: **Predicting open - vocabulary 3D semantic voxel occupancy maps from 2D images to achieve semantic segmentation, localization of 3D environments and retrieval based on natural - language queries**. Specifically, the goals of the paper are: 1. **Handle 2D - to - 3D ambiguity**: Since cameras and LiDAR sensors can only capture information of visible surfaces, and there is a projection relationship between this information in 2D and 3D, resulting in uncertainty when predicting 3D structures from 2D images. 2. **Deal with open - vocabulary tasks**: Traditional 3D semantic segmentation methods usually rely on a predefined set of categories (i.e., closed - vocabulary), while this paper aims to support open - vocabulary tasks, that is, being able to recognize new categories not seen during training. 3. **Reduce the dependence on manually - annotated data**: Obtaining 3D annotated data is very difficult and costly, so this research attempts to reduce the need for a large amount of manually - annotated 3D data through self - supervised learning algorithms. To solve these problems, the authors propose the following innovations: - **New model architecture**: Designed an architecture that includes a 2D - 3D encoder, an occupancy prediction head and a 3D language head, which is used to predict 3D semantic voxel occupancy maps from 2D images and generate text - aligned features. - **Tri - modal self - supervised learning algorithm**: Utilize data of three modalities, namely images, languages and LiDAR point clouds for training, without any explicit manual annotations. - **Open - vocabulary 3D semantic segmentation**: By mapping the pixel - level features of MaskCLIP+ to LiDAR point clouds, the model can perform zero - shot 3D semantic segmentation and language - based 3D localization tasks at the inference stage. Through these methods, the paper proposes a model named POP - 3D, which can achieve high - quality 3D semantic occupancy prediction without relying on expensive LiDAR sensors and a large amount of manually - annotated data.

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

OVO: Open-Vocabulary Occupancy

LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation

VEON: Vocabulary-Enhanced Occupancy Prediction

Language Driven Occupancy Prediction

OpenScene: 3D Scene Understanding with Open Vocabularies

InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

Learning Occupancy for Monocular 3D Object Detection

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

VPOcc: Exploiting Vanishing Point for Monocular 3D Semantic Occupancy Prediction

LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

OPUS: Occupancy Prediction Using a Sparse Set