Abstract:Open-vocabulary 3D object detection (OV-3Det) aims to generalize beyond the limited number of base categories labeled during the training phase. The biggest bottleneck is the scarcity of annotated 3D data, whereas 2D image datasets are abundant and richly annotated. Consequently, it is intuitive to leverage the wealth of annotations in 2D images to alleviate the inherent data scarcity in OV-3Det. In this paper, we push the task setup to its limits by exploring the potential of using solely 2D images to learn OV-3Det. The major challenges for this setup is the modality gap between training images and testing point clouds, which prevents effective integration of 2D knowledge into OV-3Det. To address this challenge, we propose a novel framework ImOV3D to leverage pseudo multimodal representation containing both images and point clouds (PC) to close the modality gap. The key of ImOV3D lies in flexible modality conversion where 2D images can be lifted into 3D using monocular depth estimation and can also be derived from 3D scenes through rendering. This allows unifying both training images and testing point clouds into a common image-PC representation, encompassing a wealth of 2D semantic information and also incorporating the depth and structural characteristics of 3D spatial data. We carefully conduct such conversion to minimize the domain gap between training and test cases. Extensive experiments on two benchmark datasets, SUNRGBD and ScanNet, show that ImOV3D significantly outperforms existing methods, even in the absence of ground truth 3D training data. With the inclusion of a minimal amount of real 3D data for fine-tuning, the performance also significantly surpasses previous state-of-the-art. Codes and pre-trained models are released on the <a class="link-external link-https" href="https://github.com/yangtiming/ImOV3D" rel="external noopener nofollow">this https URL</a>.

ImOV3D: Learning Open-Vocabulary Point Clouds 3D Object Detection from Only 2D Images

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

OCM3D: Object-Centric Monocular 3D Object Detection

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

3D Unsupervised Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Open-Set 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data

OBMO: One Bounding Box Multiple Objects for Monocular 3D Object Detection

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection

Unleash the Potential of Image Branch for Cross-modal 3D Object Detection

PVConvNet: Pixel-Voxel Sparse Convolution for multimodal 3D object detection

Towards Open-set Camera 3D Object Detection