Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu,Zhaoyang Zeng,Tianhe Ren,Feng Li,Hao Zhang,Jie Yang,Qing Jiang,Chunyuan Li,Jianwei Yang,Hang Su,Jun Zhu,Lei Zhang

2024-07-19

Abstract:In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{<a class="link-external link-https" href="https://github.com/IDEA-Research/GroundingDINO" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main goal of this paper is to develop an open-set object detector named Grounding DINO. This detector can detect any object through human input language information (such as category names or referring expressions). Specifically, this study combines the Transformer-based object detector DINO with language-based pre-training methods to achieve generalization to unseen object categories. To achieve this goal, the authors propose several key technical points: 1. **Tight Modal Fusion**: A method of tight modal fusion is designed based on DINO, including modules such as feature enhancers, language-guided query selection, and cross-modal decoders, to better integrate image and text information. 2. **Large-Scale Language-Based Pre-Training**: By pre-training on large-scale datasets, the model can achieve zero-shot transfer on unseen categories. These datasets include object detection data, referring data, and image description data. 3. **Clause-Level Text Features**: To reduce unnecessary mutual influence between different category names, a clause-level text representation method is introduced, which helps improve model performance. Experimental results show that Grounding DINO performs excellently on multiple benchmarks, particularly achieving significant results in zero-shot transfer tasks on datasets such as COCO, LVIS, and ODinW, and also performing well in the Referring Expression Comprehension (REC) task. Additionally, compared to other open-set object detection methods, Grounding DINO achieves leading performance in various settings.

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

WEA-DINO: An Improved DINO With Word Embedding Alignment for Remote Scene Zero-Shot Object Detection

Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation

I-DINO: High-Quality Object Detection for Indoor Scenes

Task-decoupled interactive embedding network for object detection

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

A Strong and Reproducible Object Detector with Only Public Datasets

Cascading Alignment for Unsupervised Domain-Adaptive DETR with Improved DeNoising Anchor Boxes

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

Anchor-Free Oriented Proposal Generator for Object Detection

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

More Pictures Say More: Visual Intersection Network for Open Set Object Detection

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation