Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu,Zhaoyang Zeng,Tianhe Ren,Feng Li,Hao Zhang,Jie Yang,Qing Jiang,Chunyuan Li,Jianwei Yang,Hang Su,Jun Zhu,Lei Zhang
2024-07-19
Abstract:In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{<a class="link-external link-https" href="https://github.com/IDEA-Research/GroundingDINO" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to develop an open-set object detector named Grounding DINO. This detector can detect any object through human input language information (such as category names or referring expressions). Specifically, this study combines the Transformer-based object detector DINO with language-based pre-training methods to achieve generalization to unseen object categories. To achieve this goal, the authors propose several key technical points: 1. **Tight Modal Fusion**: A method of tight modal fusion is designed based on DINO, including modules such as feature enhancers, language-guided query selection, and cross-modal decoders, to better integrate image and text information. 2. **Large-Scale Language-Based Pre-Training**: By pre-training on large-scale datasets, the model can achieve zero-shot transfer on unseen categories. These datasets include object detection data, referring data, and image description data. 3. **Clause-Level Text Features**: To reduce unnecessary mutual influence between different category names, a clause-level text representation method is introduced, which helps improve model performance. Experimental results show that Grounding DINO performs excellently on multiple benchmarks, particularly achieving significant results in zero-shot transfer tasks on datasets such as COCO, LVIS, and ODinW, and also performing well in the Referring Expression Comprehension (REC) task. Additionally, compared to other open-set object detection methods, Grounding DINO achieves leading performance in various settings.