RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

Wenyu Lv,Yian Zhao,Qinyao Chang,Kui Huang,Guanzhong Wang,Yi Liu
2024-07-24
Abstract:In this report, we present RT-DETRv2, an improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to achieve enhanced performance. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales in the deformable attention to achieve selective multi-scale feature extraction by the decoder. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator that is specific to RT-DETR compared to YOLOs. This removes the deployment constraints typically associated with DETRs. For the training strategy, we propose dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. Source code and pre-trained models will be available at <a class="link-external link-https" href="https://github.com/lyuwenyu/RT-DETR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper is primarily dedicated to improving the real-time object detector RT-DETR, proposing an enhanced version—RT-DETRv2. RT-DETRv2 aims to address issues in the field of real-time object detection, specifically including enhancing detection performance, flexibility, and practicality, while maintaining high speed. The following are the specific issues the paper attempts to resolve: 1. **Flexibility in Multi-Scale Feature Extraction**: RT-DETRv2 achieves selective multi-scale feature extraction by setting different numbers of sampling points for features of different scales within the deformable attention module, enhancing the model's sensitivity and processing capability for information at various scales. 2. **Enhanced Practicality and Deployment Constraints**: To address the deployment constraints brought by the Transformer-specific `grid_sample` operator in RT-DETR, the paper introduces an optional `discrete_sample` operator, eliminating deployment constraints associated with detection Transformers, making the model more versatile. 3. **Optimized Training Strategy**: RT-DETRv2 proposes dynamic data augmentation and adaptive hyperparameter customization strategies to improve performance without sacrificing speed. The dynamic data augmentation strategy applies stronger data augmentation at the beginning of training and gradually weakens it later on to enhance the model's generalization ability. Adaptive hyperparameter customization takes into account the characteristics of different sizes of RT-DETR models, adjusting parameters such as learning rate to achieve optimal performance. 4. **Performance Improvement**: With the aforementioned improvements, RT-DETRv2 demonstrates superior performance over the original RT-DETR across detectors of various scales, without any loss of speed. Experimental results show that RT-DETRv2 has significant improvements in Average Precision (AP) and AP50 metrics on the COCO dataset. In summary, through a series of technical innovations, the paper aims to provide a more flexible, practical, and high-performance real-time object detection baseline model, further advancing the development of the real-time detection Transformer family.