DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Tianhe Ren,Yihao Chen,Qing Jiang,Zhaoyang Zeng,Yuda Xiong,Wenlong Liu,Zhengyu Ma,Junyi Shen,Yuan Gao,Xiaoke Jiang,Xingyu Chen,Zhuheng Song,Yuhong Zhang,Hongjie Huang,Han Gao,Shilong Liu,Hao Zhang,Feng Li,Kent Yu,Lei Zhang
2024-11-22
Abstract:In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, both improving the previous SOTA performance by 5.8 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the task of object detection and understanding in the open world. Specifically, the paper introduces a unified visual model named DINO - X, which is developed by IDEA Research Institute and aims to achieve the best open - world object detection performance. DINO - X mainly solves the following problems: 1. **Long - tailed object detection**: Traditional models perform poorly when dealing with long - tailed distributed objects, that is, those objects with a lower frequency of occurrence. DINO - X enhances the detection ability of long - tailed objects by expanding input options to support text prompts, visual prompts and custom prompts. 2. **Multi - modal prompt support**: DINO - X supports not only text prompts but also visual prompts and custom prompts, which enables the model to adapt more flexibly to different detection requirements, especially in cases of scarce data or limited descriptions. 3. **Prompt - free detection**: By developing a general object prompt, DINO - X can detect any object in an image without relying on any prompts provided by the user, thus achieving true open - world detection. 4. **Multi - task perception and understanding**: DINO - X integrates multiple perception heads and can simultaneously support multiple object perception and understanding tasks, such as detection, segmentation, pose estimation, object description, etc., providing outputs at different levels from coarse to fine. 5. **Edge - device optimization**: In addition to providing the DINO - X Pro model with enhanced perception capabilities, the DINO - X Edge model optimized for edge devices is also introduced to achieve efficient inference in resource - constrained environments. In summary, DINO - X aims to improve the efficiency and accuracy of object detection and understanding in the open world through its powerful multi - modal prompt support and multi - task perception capabilities, especially in applications dealing with long - tailed objects and resource - constrained environments.