YOLO-Former: YOLO Shakes Hand With ViT

Javad Khoramdel,Ahmad Moori,Yasamin Borhani,Armin Ghanbarzadeh,Esmaeil Najafi
2024-01-12
Abstract:The proposed YOLO-Former method seamlessly integrates the ideas of transformer and YOLOv4 to create a highly accurate and efficient object detection system. The method leverages the fast inference speed of YOLOv4 and incorporates the advantages of the transformer architecture through the integration of convolutional attention and transformer modules. The results demonstrate the effectiveness of the proposed approach, with a mean average precision (mAP) of 85.76\% on the Pascal VOC dataset, while maintaining high prediction speed with a frame rate of 10.85 frames per second. The contribution of this work lies in the demonstration of how the innovative combination of these two state-of-the-art techniques can lead to further improvements in the field of object detection.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to propose a new object detection method called YOLO-Former, which combines the speed advantages of YOLOv4 with the benefits of the Transformer architecture to create an efficient and accurate object detection system. Specifically, the authors achieve this goal through the following means: 1. **Innovative Integration**: Seamlessly combining the ideas of Transformer with YOLOv4, leveraging the fast inference advantage of YOLOv4, and utilizing the benefits of the Transformer architecture by introducing convolutional attention and Transformer modules. 2. **Novel Attention Mechanism**: Developing a new Convolutional Self-Attention Module (CSAM) based on Scaled Dot-Product Self-Attention (SDSA) and integrating it into the YOLOv4 structure. 3. **Enhanced Transformer Module**: Designing a Convolutional Transformer Module to replace the residual blocks in YOLOv4, retaining the characteristics of residual connections while enabling the network to learn to focus on regions of interest. 4. **Data Augmentation and Regularization**: Employing various data augmentation strategies (such as RandAugment, AugMix, etc.) and regularization techniques (such as Scheduled DropBlock, Shake-Shake, etc.) to improve the model's generalization ability. 5. **Experimental Validation**: Conducting extensive experiments on the Pascal VOC dataset to demonstrate the effectiveness of the proposed YOLO-Former method, achieving an average precision (mAP) of 85.76% while maintaining a high prediction speed (10.85 frames per second). In summary, this study aims to show how combining two state-of-the-art technologies—YOLOv4 and Transformer—can further enhance performance in the field of object detection.