HCLT-YOLO: A Hybrid CNN and Lightweight Transformer Architecture for Object Detection in Complex Traffic Scenes

Zhige Chen,Kai Yang,Yandong Wu,Hao Yang,Xiaolin Tang
DOI: https://doi.org/10.1109/tvt.2024.3496513
IF: 6.8
2024-01-01
IEEE Transactions on Vehicular Technology
Abstract:The swift and accurate detection of traffic signs in traffic scenes is a pivotal aspect of environmental perception technology in autonomous driving systems. Traffic signs provide essential road information and regulatory instructions, which are critical to ensuring road safety. This paper presents the HCLT-YOLO model to address the challenges of false alarms and missed detections in complex traffic environments. Specifically, we propose a novel hybrid CNN-transformer network architecture that efficiently integrates both local and global features, thereby improving traffic sign feature representation. To further enhance the model's sensitivity to small traffic signs, we optimize the structure by introducing a dedicated small-object detection layer through upsampling and by leveraging SIoU to improve detection accuracy and computational efficiency. However, the addition of the small object detection layer and the Transformer module increases the overall computational complexity and parameter count, potentially affecting real-time performance. To address this issue, we introduce the DG-C2f module, which employs linear transformations for feature mapping, streamlining the convolution process and enhancing real-time feasibility. Experimental evaluations on the GTSDB and TT100K datasets demonstrate that the proposed model improves detection accuracy by 2.5% and 6.8%, respectively, compared to YOLOv8s models. Notably, the detection accuracy for small traffic signs improved significantly, by 6.9% and 11.7%, respectively. Additionally, processor-in-the-loop experiments on the NVIDIA Jetson AGX Orin show that the model achieves an inference speed of 46 FPS, meeting the real-time requirements for in-vehicle applications.
What problem does this paper attempt to address?