OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang,Pengzhen Ren,Zequn Jie,Xiao Dong,Chengjian Feng,Yinlong Qian,Lin Ma,Dongmei Jiang,Yaowei Wang,Xiangyuan Lan,Xiaodan Liang

2024-07-22

Abstract:Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at <a class="link-external link-https" href="https://github.com/wanghao9610/OV-DINO" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper primarily aims to address two key challenges in Open-Vocabulary Detection (OVD): 1. **Pseudo-Label Noise Issue**: Existing methods achieve zero-shot detection capability by pre-training on large-scale datasets and generating pseudo-labels on image-text data. However, this approach introduces data noise, leading to inaccurate predictions when the model encounters new, unseen categories. 2. **Cross-Modal Alignment Issue**: Open-vocabulary detection methods need to detect corresponding objects in images based on specific category descriptions. Given the diverse features of objects in images, aligning these objects with specific category descriptions is a challenge. To address these two issues, the authors propose a new unified open-vocabulary detection method—OV-DINO. Specifically, they introduce the following two key techniques: - **Unified Data Integration (UniDI) Pipeline**: This integrates data from different sources into a unified detection format and performs end-to-end pre-training on large-scale datasets, thereby eliminating the need for pseudo-label generation and enhancing vocabulary concepts. - **Language-Aware Selective Fusion (LASF) Module**: This improves cross-modal alignment at the region level through a language-aware query selection and fusion process. With these techniques, OV-DINO achieves significant performance improvements on COCO and LVIS benchmarks, particularly in zero-shot settings, with AP improvements of 2.5% and 12.7% respectively compared to the previous G-DINO method.

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

A Simple Framework for Open-Vocabulary Segmentation and Detection

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Learning Object-Language Alignments for Open-Vocabulary Object Detection

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

WEA-DINO: An Improved DINO With Word Embedding Alignment for Remote Scene Zero-Shot Object Detection

OpenSD: Unified Open-Vocabulary Segmentation and Detection

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection