Abstract:We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: <a class="link-external link-https" href="https://github.com/Peterande/D-FINE" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in real - time object detection, especially those related to bounding box regression and the efficiency of knowledge distillation. Specifically: 1. **Optimization Difficulties in Bounding Box Regression**: - **Limitations of Fixed - coordinate Regression**: Most detectors predict bounding boxes by regressing fixed coordinates, which makes it difficult for the model to handle localization uncertainty and is sensitive to small coordinate changes during the optimization process, resulting in slow convergence and poor performance. - **Insufficient Distribution Modeling**: Although some methods (such as GFocal) handle uncertainty and ambiguity through probability distributions, they are still limited by anchor - dependence, lack of iterative refinement, and coarse - grained localization. 2. **Efficiency Improvement of Real - Time Detectors**: - **Limited Computational Resources**: Real - time detectors need to maintain speed within a limited computational resource and parameter budget, which places higher requirements on the model design. - **Effectiveness of Knowledge Distillation**: Traditional knowledge distillation methods (such as Logit Mimicking and Feature Imitation) are not effective in detection tasks and may even lead to performance degradation. And the existing localization distillation methods have problems such as large training overhead and incompatibility with anchor - free detectors. To solve these problems, the paper proposes D - FINE (Fine - grained Distribution Refinement in DETR), a powerful real - time object detector, which redefines the bounding box regression task through the following two key components: - **Fine - grained Distribution Refinement (FDR)**: - Transforms bounding box regression from predicting fixed coordinates to iteratively refining probability distributions, providing a more fine - grained intermediate representation and significantly improving localization accuracy. - Allows more precise and incremental adjustments through a non - uniform weighting function, improving localization accuracy and reducing prediction errors. - **Global Optimal Localization Self - Distillation (GO - LSD)**: - Transfers the localization knowledge of deep layers to shallow layers through self - distillation, enabling the shallow layers to make better early adjustments, accelerating convergence and improving overall performance. - Optimizes unmatched predictions during the training process to improve overall stability. In addition, D - FINE also makes lightweight optimizations in the existing real - time DETR architecture, further improving speed and efficiency, while effectively alleviating the performance loss caused by these optimizations through FDR and GO - LSD. Experimental results show that D - FINE achieves state - of - the - art real - time object detection performance on the COCO dataset, not only surpassing existing models in accuracy and efficiency, but also performing particularly well after large - scale pre - training.

D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement

Enhancing Your Trained DETRs with Box Refinement

Rank-DETR for High Quality Object Detection

Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection

Single-Shot Refinement Neural Network for Object Detection

Suppress-and-Refine Framework for End-to-End 3D Object Detection

DEIM: DETR with Improved Matching for Fast Convergence

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Deformable DETR: Deformable Transformers for End-to-End Object Detection

RefineFace: Refinement Neural Network for High Performance Face Detection

DETRs Beat YOLOs on Real-time Object Detection

Guided Refine-Head For Object Detection

DETR Doesn't Need Multi-Scale or Locality Design

Decoupled Classification Refinement: Hard False Positive Suppression for Object Detection

RefineDetLite: A Lightweight One-stage Object Detection Framework for CPU-only Devices

Learning 1-Bit Tiny Object Detector with Discriminative Feature Refinement

DFE-Net: detail feature extraction network for small object detection

An Improved DETR Based on Angle Denoising and Oriented Boxes Refinement for Remote Sensing Object Detection

DFD: Distillng the Feature Disparity Differently for Detectors

RTMDet: An Empirical Study of Designing Real-Time Object Detectors

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection