LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

Qiang Chen,Xiangbo Su,Xinyu Zhang,Jian Wang,Jiahui Chen,Yunpeng Shen,Chuchu Han,Ziliang Chen,Weixiang Xu,Fanrong Li,Shan Zhang,Kun Yao,Errui Ding,Gang Zhang,Jingdong Wang
2024-06-06
Abstract:In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (<a class="link-external link-https" href="https://github.com/Atten4Vis/LW-DETR" rel="external noopener nofollow">this https URL</a>).
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: **performance and efficiency issues in real - time object detection**. Specifically, the author aims to build a lightweight Transformer - based object detection model (LW - DETR) to outperform existing convolution - network - based methods, such as the YOLO series, in real - time object detection tasks. ### Detailed Explanation: 1. **Research Background**: - The current mainstream real - time object detection methods mainly rely on convolutional neural networks (CNN), such as the YOLO series. - Although Transformer - based methods (such as DETR) have made significant progress in the field of object detection, they have not been fully explored in real - time detection, and it is unclear whether their performance can be comparable to the state - of - the - art convolutional methods. 2. **Research Objectives**: - Build a lightweight DETR model (LW - DETR) for real - time object detection. - Improve the detection performance and inference efficiency of the model by introducing techniques such as multi - scale feature aggregation, staggered windows, and global attention mechanisms. - Explore effective training techniques, such as improved loss functions, pre - training strategies, etc., to further improve the performance of the model. 3. **Specific Problems**: - How to design a lightweight and efficient Transformer architecture so that it is competitive in real - time object detection tasks? - How to reduce the computational complexity and memory consumption by optimizing the training and inference processes, thereby increasing the inference speed of the model? - How to use large - scale data pre - training to improve the generalization ability and detection accuracy of the model? 4. **Solutions**: - **Architecture Design**: Adopt a simple stacked structure of ViT encoder, projection layer, and shallow DETR decoder. Reduce the computational complexity and enhance the feature representation ability through multi - scale feature aggregation, staggered windows, and global attention mechanisms. - **Training Techniques**: Introduce IoU - aware classification loss, deformable cross - attention, pre - training strategies, etc., to improve the training effect and detection performance of the model. - **Inference Optimization**: Reduce memory swap operations and inference latency through the window - first feature map organization method. 5. **Experimental Results**: - Experiments show that LW - DETR significantly outperforms existing real - time detectors (such as YOLO - NAS, YOLOv8, RTMDet, etc.) on COCO and other benchmark datasets, especially in terms of mAP (mean Average Precision) and inference time. In summary, this paper aims to solve the performance and efficiency problems in real - time object detection by building a lightweight and efficient Transformer - based object detection model, and has verified its superiority on multiple benchmark datasets.