ST-YOLOX: a Lightweight and Accurate Object Detection Network Based on Swin Transformer

Jingjing Han,Guangqi Yang,Hongyang Wei,Weijun Gong,Yurong Qian
DOI: https://doi.org/10.1007/s11227-023-05744-9
2024-01-01
Abstract:With the rapid development of artificial intelligence and Internet of Things (IoT) technology, increasingly edge devices have entered people’s daily lives. However, due to the limited performance of edge devices, complex models can affect the response speed and efficiency of the whole system. Existing research still cannot simultaneously satisfy the demand for accuracy and response speed of edge devices. This paper proposes a lightweight and highly accurate object detection model that uses the Transformer to address edge devices’ limited computational capacity and storage space. Specifically, the proposed model adopts the Swin Transformer for multi-scale feature extraction to achieve better global modeling capability. In addition, we propose the Neck module with path aggregation network (PAN), which is designed with a two-feature pyramid structure capable of combining semantic and localization information in order to improve the operational performance by exploiting the underlying location features. A lightweight detection head is then developed using group convolution, fusing the two localization branches and removes the additional decoupling operation. Finally, we conduct comparative experiments on three datasets: the Retail-cabinet dataset, the Roadsign dataset, and the Pascal VOC dataset. Experimental results show that compared with the baseline model, our model achieves an 11.8% improvement in mAP on the Retail-cabinet dataset while reducing Params and FLOPs by 23.19% and 71.50%, respectively. The proposed model effectively reduces the model’s computational complexity and improves detection performance, thereby possessing high practical value. This code is released on https://github.com/ydlam/ST-YOLOX .
What problem does this paper attempt to address?