D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement

Yansong Peng,Hebei Li,Peixi Wu,Yueyi Zhang,Xiaoyan Sun,Feng Wu
2024-10-18
Abstract:We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: <a class="link-external link-https" href="https://github.com/Peterande/D-FINE" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in real - time object detection, especially those related to bounding box regression and the efficiency of knowledge distillation. Specifically: 1. **Optimization Difficulties in Bounding Box Regression**: - **Limitations of Fixed - coordinate Regression**: Most detectors predict bounding boxes by regressing fixed coordinates, which makes it difficult for the model to handle localization uncertainty and is sensitive to small coordinate changes during the optimization process, resulting in slow convergence and poor performance. - **Insufficient Distribution Modeling**: Although some methods (such as GFocal) handle uncertainty and ambiguity through probability distributions, they are still limited by anchor - dependence, lack of iterative refinement, and coarse - grained localization. 2. **Efficiency Improvement of Real - Time Detectors**: - **Limited Computational Resources**: Real - time detectors need to maintain speed within a limited computational resource and parameter budget, which places higher requirements on the model design. - **Effectiveness of Knowledge Distillation**: Traditional knowledge distillation methods (such as Logit Mimicking and Feature Imitation) are not effective in detection tasks and may even lead to performance degradation. And the existing localization distillation methods have problems such as large training overhead and incompatibility with anchor - free detectors. To solve these problems, the paper proposes D - FINE (Fine - grained Distribution Refinement in DETR), a powerful real - time object detector, which redefines the bounding box regression task through the following two key components: - **Fine - grained Distribution Refinement (FDR)**: - Transforms bounding box regression from predicting fixed coordinates to iteratively refining probability distributions, providing a more fine - grained intermediate representation and significantly improving localization accuracy. - Allows more precise and incremental adjustments through a non - uniform weighting function, improving localization accuracy and reducing prediction errors. - **Global Optimal Localization Self - Distillation (GO - LSD)**: - Transfers the localization knowledge of deep layers to shallow layers through self - distillation, enabling the shallow layers to make better early adjustments, accelerating convergence and improving overall performance. - Optimizes unmatched predictions during the training process to improve overall stability. In addition, D - FINE also makes lightweight optimizations in the existing real - time DETR architecture, further improving speed and efficiency, while effectively alleviating the performance loss caused by these optimizations through FDR and GO - LSD. Experimental results show that D - FINE achieves state - of - the - art real - time object detection performance on the COCO dataset, not only surpassing existing models in accuracy and efficiency, but also performing particularly well after large - scale pre - training.