Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao,Na Zhao,Jingjing Chen,Yu-Gang Jiang
2024-07-18
Abstract:Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision foundation model to enable the zero-shot discovery of objects in images, which serves as the initial seeds and filtering guidance to identify novel 3D objects. Additionally, to align the 3D space with the powerful vision-language space, we introduce a hierarchical alignment approach, where the 3D feature space is aligned with the vision-language feature space using a pre-trained VLM at the instance, category, and scene levels. Through extensive experimentation, we demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection in real-world scenarios.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in open - vocabulary 3D object detection (OV - 3DDet), especially when dealing with unseen object categories. Traditional 3D object detection methods usually assume that the targets in the test phase are consistent with those observed in the training phase, which is unrealistic in dynamically changing real - world scenarios. Therefore, the ability of open - vocabulary 3D object detection becomes crucial, as it can locate and identify known and unknown objects in new scenarios. Specifically, the paper points out that although language and vision foundation models have achieved success in handling various open - vocabulary tasks, one of the main challenges faced by OV - 3DDet is the limitation of training data. Although some preliminary efforts have attempted to integrate the knowledge of vision - language models (VLM) into the learning of OV - 3DDet, the full potential of these foundation models has not been fully exploited. To overcome these challenges, the paper proposes a new method - Image - Guided Novel Class Discovery and Hierarchical Feature Space Alignment (INHA). This method unlocks textual and visual intelligence by leveraging language and vision foundation models to solve open - vocabulary 3D detection tasks. The INHA method mainly consists of two key components: 1. **Image - Guided Novel Class Discovery (IGND)**: Utilize a pre - trained open - vocabulary 2D detector to extract valuable object - level information (2D object bounding boxes), and effectively combine this information with valuable 3D data to guide the discovery of 3D new objects. The specific process includes lifting the 2D object center points to 3D space, providing additional query seeds, and using the bounding boxes of 2D objects to select reliable 3D new bounding boxes. 2. **Hierarchical Feature Space Alignment**: Align the 3D feature space with the vision - language feature space at the instance level, category level, and scene level. This includes aligning the features of different modalities through contrastive learning methods to enhance the generalization ability and accuracy of the model. Through these methods, the paper aims to significantly improve the accuracy and generalization ability of open - vocabulary 3D object detection, demonstrating the potential of foundation models in promoting open - vocabulary 3D object detection in real - world scenarios.