YOLO-Former: Marrying YOLO and Transformer for Foreign Object Detection

Yuan Dai,Weiming Liu,Heng Wang,Wei Xie,Kejun Long
DOI: https://doi.org/10.1109/TIM.2022.3219468
IF: 5.6
2022-01-01
IEEE Transactions on Instrumentation and Measurement
Abstract:The automatic detection of foreign objects between platform screen doors (PSDs) and metro train doors significantly affects personnel and property safety and maintains the train's normal operation. However, some existing works only determine the presence of foreign objects but cannot indicate their categories. Besides, although deep-learning-based object detection algorithms can indicate the presence and categories of foreign objects, most of them only harness the information in region proposals, ignoring global contextual information. Furthermore, their performance comes at the considerable cost of computational complexity, and leading cannot be well deployed in the metro environment. To address these issues and better implement foreign object detection (FOD), we present You Only Look Once-Transformer (YOLO-Former), a simple but efficient model. YOLO-Former is accomplished based on YOLOv5 through the following procedure. First, the vision transformer (ViT) is introduced for dynamic attention and global modeling, thereby solving the problem that the original YOLOv5 only utilizes information in region proposals and has insufficient ability to capture global information. Second, the convolutional block attention module (CBAM) and the stem module are used to improve feature expression ability further and reduce floating-point operations (FLOPs). Finally, we design various variants with different widths and depths to meet every need. Experiments on the FOD dataset (FODD) and the PASCAL VOC dataset demonstrate that YOLO-Former-x consistently outperforms other state of the arts with significant margins (0.5-11.3 mean average precision (mAP) on FODD and 0.6-13.6 on the PASCAL VOC dataset). Last but not least, YOLO-Former-x maintains real-time processing speed (27.32 and 28.17 frames/s (FPS) on TITAN Xp).
What problem does this paper attempt to address?