Abstract:Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision foundation model to enable the zero-shot discovery of objects in images, which serves as the initial seeds and filtering guidance to identify novel 3D objects. Additionally, to align the 3D space with the powerful vision-language space, we introduce a hierarchical alignment approach, where the 3D feature space is aligned with the vision-language feature space using a pre-trained VLM at the instance, category, and scene levels. Through extensive experimentation, we demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection in real-world scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in open - vocabulary 3D object detection (OV - 3DDet), especially when dealing with unseen object categories. Traditional 3D object detection methods usually assume that the targets in the test phase are consistent with those observed in the training phase, which is unrealistic in dynamically changing real - world scenarios. Therefore, the ability of open - vocabulary 3D object detection becomes crucial, as it can locate and identify known and unknown objects in new scenarios. Specifically, the paper points out that although language and vision foundation models have achieved success in handling various open - vocabulary tasks, one of the main challenges faced by OV - 3DDet is the limitation of training data. Although some preliminary efforts have attempted to integrate the knowledge of vision - language models (VLM) into the learning of OV - 3DDet, the full potential of these foundation models has not been fully exploited. To overcome these challenges, the paper proposes a new method - Image - Guided Novel Class Discovery and Hierarchical Feature Space Alignment (INHA). This method unlocks textual and visual intelligence by leveraging language and vision foundation models to solve open - vocabulary 3D detection tasks. The INHA method mainly consists of two key components: 1. **Image - Guided Novel Class Discovery (IGND)**: Utilize a pre - trained open - vocabulary 2D detector to extract valuable object - level information (2D object bounding boxes), and effectively combine this information with valuable 3D data to guide the discovery of 3D new objects. The specific process includes lifting the 2D object center points to 3D space, providing additional query seeds, and using the bounding boxes of 2D objects to select reliable 3D new bounding boxes. 2. **Hierarchical Feature Space Alignment**: Align the 3D feature space with the vision - language feature space at the instance level, category level, and scene level. This includes aligning the features of different modalities through contrastive learning methods to enhance the generalization ability and accuracy of the model. Through these methods, the paper aims to significantly improve the accuracy and generalization ability of open - vocabulary 3D object detection, demonstrating the potential of foundation models in promoting open - vocabulary 3D object detection in real - world scenarios.

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

ImOV3D: Learning Open-Vocabulary Point Clouds 3D Object Detection from Only 2D Images

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

OV-VG: A benchmark for open-vocabulary visual grounding

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

Learning Object-Language Alignments for Open-Vocabulary Object Detection

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection