Abstract:In this report, we present RT-DETRv2, an improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to achieve enhanced performance. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales in the deformable attention to achieve selective multi-scale feature extraction by the decoder. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator that is specific to RT-DETR compared to YOLOs. This removes the deployment constraints typically associated with DETRs. For the training strategy, we propose dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. Source code and pre-trained models will be available at <a class="link-external link-https" href="https://github.com/lyuwenyu/RT-DETR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper is primarily dedicated to improving the real-time object detector RT-DETR, proposing an enhanced version—RT-DETRv2. RT-DETRv2 aims to address issues in the field of real-time object detection, specifically including enhancing detection performance, flexibility, and practicality, while maintaining high speed. The following are the specific issues the paper attempts to resolve: 1. **Flexibility in Multi-Scale Feature Extraction**: RT-DETRv2 achieves selective multi-scale feature extraction by setting different numbers of sampling points for features of different scales within the deformable attention module, enhancing the model's sensitivity and processing capability for information at various scales. 2. **Enhanced Practicality and Deployment Constraints**: To address the deployment constraints brought by the Transformer-specific `grid_sample` operator in RT-DETR, the paper introduces an optional `discrete_sample` operator, eliminating deployment constraints associated with detection Transformers, making the model more versatile. 3. **Optimized Training Strategy**: RT-DETRv2 proposes dynamic data augmentation and adaptive hyperparameter customization strategies to improve performance without sacrificing speed. The dynamic data augmentation strategy applies stronger data augmentation at the beginning of training and gradually weakens it later on to enhance the model's generalization ability. Adaptive hyperparameter customization takes into account the characteristics of different sizes of RT-DETR models, adjusting parameters such as learning rate to achieve optimal performance. 4. **Performance Improvement**: With the aforementioned improvements, RT-DETRv2 demonstrates superior performance over the original RT-DETR across detectors of various scales, without any loss of speed. Experimental results show that RT-DETRv2 has significant improvements in Average Precision (AP) and AP50 metrics on the COCO dataset. In summary, through a series of technical innovations, the paper aims to provide a more flexible, practical, and high-performance real-time object detection baseline model, further advancing the development of the real-time detection Transformer family.

RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

DETRs Beat YOLOs on Real-time Object Detection

RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

DV-DETR: Improved UAV Aerial Small Target Detection Algorithm Based on RT-DETR

Van-DETR: enhanced real-time object detection with vanillanet and advanced feature fusion

Deformable DETR: Deformable Transformers for End-to-End Object Detection

AugDETR: Improving Multi-scale Learning for Detection Transformer

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

EvRT-DETR: The Surprising Effectiveness of DETR-based Detection for Event Cameras

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

Towards Data-Efficient Detection Transformers

DETR++: Taming Your Multi-Scale Detection Transformer

L-DETR: A Light-Weight Detector for End-to-End Object Detection With Transformers

Small Object Detection by DETR via Information Augmentation and Adaptive Feature Fusion

DETR-ORD: An Improved DETR Detector for Oriented Remote Sensing Object Detection with Feature Reconstruction and Dynamic Query

Detrex: Benchmarking Detection Transformers

RTMDet: An Empirical Study of Designing Real-Time Object Detectors

DEIM: DETR with Improved Matching for Fast Convergence

Improved Real-Time Detection Transformer-Based Rail Fastener Defect Detection Algorithm

PR-Deformable DETR: DETR for Remote Sensing Object Detection