Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Rui Huang,Henry Zheng,Yan Wang,Zhuofan Xia,Marco Pavone,Gao Huang

2024-11-24

Abstract:Open-vocabulary 3D object detection has recently attracted considerable attention due to its broad applications in autonomous driving and robotics, which aims to effectively recognize novel classes in previously unseen domains. However, existing point cloud-based open-vocabulary 3D detection models are limited by their high deployment costs. In this work, we propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data. Unlike traditional methods, OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. Instead, it employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors. However, training 3D models with labels directly derived from pseudo-LiDAR is inadequate due to imprecise boxes estimated from noisy point clouds and severely occluded objects. To address these issues, we introduce two innovative designs: adaptive pseudo-LiDAR erosion and bounding box refinement with prior knowledge from large language models. These techniques effectively calibrate the 3D labels and enable RGB-only training for 3D detectors. Extensive experiments demonstrate the superiority of OVM3D-Det over baselines in both indoor and outdoor scenarios. The code will be released.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to train an open - vocabulary monocular 3D object detection model without 3D data. Specifically, the existing open - vocabulary 3D detection models based on point clouds are limited due to high deployment costs. This paper proposes a new open - vocabulary monocular 3D object detection framework (OVM3D - Det), which is trained only with RGB images, so that it can be scaled using publicly available data while maintaining cost - effectiveness. Unlike traditional methods, OVM3D - Det does not require high - precision LiDAR or 3D sensor data as input or to generate 3D bounding boxes. Instead, it uses an open - vocabulary 2D model and pseudo - LiDAR to automatically label 3D objects in RGB images, facilitating the learning of open - vocabulary monocular 3D detectors. However, it is insufficient to directly train a 3D model with labels generated by pseudo - LiDAR, because the bounding boxes estimated from noisy point clouds are inaccurate and objects with severe occlusion are difficult to handle. To solve these problems, this paper introduces two innovative designs: adaptive pseudo - LiDAR erosion and bounding - box optimization using prior knowledge of large - language models. These techniques effectively calibrate the 3D labels and make it possible to train 3D detectors using only RGB images. Through extensive experiments, the superiority of OVM3D - Det in indoor and outdoor scenarios has been proven. The code will be publicly released.

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

ImOV3D: Learning Open-Vocabulary Point Clouds 3D Object Detection from Only 2D Images

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

OCM3D: Object-Centric Monocular 3D Object Detection

Towards Open-set Camera 3D Object Detection

Learning Occupancy for Monocular 3D Object Detection

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

Every Dataset Counts: Scaling up Monocular 3D Object Detection with Joint Datasets Training

ODM3D: Alleviating Foreground Sparsity for Semi-Supervised Monocular 3D Object Detection

OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection

ODM3D: Alleviating Foreground Sparsity for Enhanced Semi-Supervised Monocular 3D Object Detection

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

OBMO: One Bounding Box Multiple Objects for Monocular 3D Object Detection

Weakly Supervised Monocular 3D Object Detection by Spatial-Temporal View Consistency

Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images with Virtual Depth