Abstract:In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (<a class="link-external link-https" href="https://github.com/Atten4Vis/LW-DETR" rel="external noopener nofollow">this https URL</a>).

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: **performance and efficiency issues in real - time object detection**. Specifically, the author aims to build a lightweight Transformer - based object detection model (LW - DETR) to outperform existing convolution - network - based methods, such as the YOLO series, in real - time object detection tasks. ### Detailed Explanation: 1. **Research Background**: - The current mainstream real - time object detection methods mainly rely on convolutional neural networks (CNN), such as the YOLO series. - Although Transformer - based methods (such as DETR) have made significant progress in the field of object detection, they have not been fully explored in real - time detection, and it is unclear whether their performance can be comparable to the state - of - the - art convolutional methods. 2. **Research Objectives**: - Build a lightweight DETR model (LW - DETR) for real - time object detection. - Improve the detection performance and inference efficiency of the model by introducing techniques such as multi - scale feature aggregation, staggered windows, and global attention mechanisms. - Explore effective training techniques, such as improved loss functions, pre - training strategies, etc., to further improve the performance of the model. 3. **Specific Problems**: - How to design a lightweight and efficient Transformer architecture so that it is competitive in real - time object detection tasks? - How to reduce the computational complexity and memory consumption by optimizing the training and inference processes, thereby increasing the inference speed of the model? - How to use large - scale data pre - training to improve the generalization ability and detection accuracy of the model? 4. **Solutions**: - **Architecture Design**: Adopt a simple stacked structure of ViT encoder, projection layer, and shallow DETR decoder. Reduce the computational complexity and enhance the feature representation ability through multi - scale feature aggregation, staggered windows, and global attention mechanisms. - **Training Techniques**: Introduce IoU - aware classification loss, deformable cross - attention, pre - training strategies, etc., to improve the training effect and detection performance of the model. - **Inference Optimization**: Reduce memory swap operations and inference latency through the window - first feature map organization method. 5. **Experimental Results**: - Experiments show that LW - DETR significantly outperforms existing real - time detectors (such as YOLO - NAS, YOLOv8, RTMDet, etc.) on COCO and other benchmark datasets, especially in terms of mAP (mean Average Precision) and inference time. In summary, this paper aims to solve the performance and efficiency problems in real - time object detection by building a lightweight and efficient Transformer - based object detection model, and has verified its superiority on multiple benchmark datasets.

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

DETRs Beat YOLOs on Real-time Object Detection

L-DETR: A Light-Weight Detector for End-to-End Object Detection With Transformers

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

LED-DETR: Lightweight, Efficient and Decoupled Object Detection with Transformers

RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

EF-DETR: A Lightweight Transformer-Based Object Detector with an Encoder-Free Neck

WB-DETR: Transformer-Based Detector Without Backbone

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

Efficient Decoder-Free Object Detection with Transformers

Low-light DEtection TRansformer (LDETR): object detection in low-light and adverse weather conditions

MV-DETR: Multi-modality indoor object detection by Multi-View DEtecton TRansformers

Fully Transformer Detector with Multiscale Encoder and Dynamic Decoder

RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision

Van-DETR: enhanced real-time object detection with vanillanet and advanced feature fusion

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

DHS-DETR: Efficient DETRs with Dynamic Head Switching

IST-DETR: Improved DETR for Infrared Small Target Detection

TSD-DETR: A Lightweight Real-Time Detection Transformer of Traffic Sign Detection for Long-Range Perception of Autonomous Driving