Abstract:In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, both improving the previous SOTA performance by 5.8 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the task of object detection and understanding in the open world. Specifically, the paper introduces a unified visual model named DINO - X, which is developed by IDEA Research Institute and aims to achieve the best open - world object detection performance. DINO - X mainly solves the following problems: 1. **Long - tailed object detection**: Traditional models perform poorly when dealing with long - tailed distributed objects, that is, those objects with a lower frequency of occurrence. DINO - X enhances the detection ability of long - tailed objects by expanding input options to support text prompts, visual prompts and custom prompts. 2. **Multi - modal prompt support**: DINO - X supports not only text prompts but also visual prompts and custom prompts, which enables the model to adapt more flexibly to different detection requirements, especially in cases of scarce data or limited descriptions. 3. **Prompt - free detection**: By developing a general object prompt, DINO - X can detect any object in an image without relying on any prompts provided by the user, thus achieving true open - world detection. 4. **Multi - task perception and understanding**: DINO - X integrates multiple perception heads and can simultaneously support multiple object perception and understanding tasks, such as detection, segmentation, pose estimation, object description, etc., providing outputs at different levels from coarse to fine. 5. **Edge - device optimization**: In addition to providing the DINO - X Pro model with enhanced perception capabilities, the DINO - X Edge model optimized for edge devices is also introduced to achieve efficient inference in resource - constrained environments. In summary, DINO - X aims to improve the efficiency and accuracy of object detection and understanding in the open world through its powerful multi - modal prompt support and multi - task perception capabilities, especially in applications dealing with long - tailed objects and resource - constrained environments.

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation

Universal Object Detection with Large Vision Model

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

WEA-DINO: An Improved DINO With Word Embedding Alignment for Remote Scene Zero-Shot Object Detection

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

I-DINO: High-Quality Object Detection for Indoor Scenes

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Change Dino: A Unified Transformer-Based Framework for Object-Level Change Detection and Segmentation in Remote Sensing Imagery

Swiss DINO: Efficient and Versatile Vision Framework for On-device Personal Object Search

Detecting Everything in the Open World: Towards Universal Object Detection

More Pictures Say More: Visual Intersection Network for Open Set Object Detection

DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

A Strong and Reproducible Object Detector with Only Public Datasets

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

Bridging the Gap to Real-World Object-Centric Learning