OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang,Pengzhen Ren,Zequn Jie,Xiao Dong,Chengjian Feng,Yinlong Qian,Lin Ma,Dongmei Jiang,Yaowei Wang,Xiangyuan Lan,Xiaodan Liang
2024-07-22
Abstract:Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at <a class="link-external link-https" href="https://github.com/wanghao9610/OV-DINO" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper primarily aims to address two key challenges in Open-Vocabulary Detection (OVD): 1. **Pseudo-Label Noise Issue**: Existing methods achieve zero-shot detection capability by pre-training on large-scale datasets and generating pseudo-labels on image-text data. However, this approach introduces data noise, leading to inaccurate predictions when the model encounters new, unseen categories. 2. **Cross-Modal Alignment Issue**: Open-vocabulary detection methods need to detect corresponding objects in images based on specific category descriptions. Given the diverse features of objects in images, aligning these objects with specific category descriptions is a challenge. To address these two issues, the authors propose a new unified open-vocabulary detection method—OV-DINO. Specifically, they introduce the following two key techniques: - **Unified Data Integration (UniDI) Pipeline**: This integrates data from different sources into a unified detection format and performs end-to-end pre-training on large-scale datasets, thereby eliminating the need for pseudo-label generation and enhancing vocabulary concepts. - **Language-Aware Selective Fusion (LASF) Module**: This improves cross-modal alignment at the region level through a language-aware query selection and fusion process. With these techniques, OV-DINO achieves significant performance improvements on COCO and LVIS benchmarks, particularly in zero-shot settings, with AP improvements of 2.5% and 12.7% respectively compared to the previous G-DINO method.