Abstract:Since introduced, Swin Transformer has achieved remarkable results in the field of computer vision, it has sparked the need for dedicated hardware accelerators, specifically catering to edge computing demands. For the advantages of flexibility, low power consumption, FPGAs have been widely employed to accelerate the inference of convolutional neural networks (CNNs) and show potential in Transformer-based models. Unlike CNNs, which mainly involve multiply and accumulate (MAC) operations, Transformer involve non-linear computations such as Layer Normalization (LN), Softmax, and GELU. These nonlinear computations do pose challenges for accelerator design. In this paper, to propose an efficient FPGA-based hardware accelerator for Swin Transformer, we focused on using different strategies to deal with these nonlinear calculations and efficiently handling MAC computations to achieve the best acceleration results. We replaced LN with BN, Given that Batch Normalization (BN) can be fused with linear layers during inference to optimize inference efficiency. The modified Swin-T, Swin-S, and Swin-B respectively achieved Top-1 accuracy rates of 80.7%, 82.7%, and 82.8% in ImageNet. Furthermore, We employed strategies for approximate computation to design hardware-friendly architectures for Softmax and GELU computations. We also designed an efficient Matrix Multiplication Unit to handle all linear computations in Swin Transformer. As a conclude, compared with CPU (AMD Ryzen 5700X), our accelerator achieved 1.76x, 1.66x, and 1.25x speedup and achieved 20.45x, 18.60x, and 14.63x energy efficiency (FPS/power consumption) improvement on Swin-T, Swin-S, and Swin-B models, respectively. Compared to GPU (Nvidia RTX 2080 Ti), we achieved 5.05x, 4.42x, and 3.00x energy efficiency improvement respectively. As far as we know, the accelerator we proposed is the fastest FPGA-based accelerator for Swin Transformer.

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Swin Transformer with Local Aggregation

Beyond Fixation: Dynamic Window Visual Transformer

SWAT: an Efficient Swin Transformer Accelerator Based on FPGA

SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection

Energy Consumption Optimization of Swin Transformer Based on Local Aggregation and Group-Wise Transformation

SwinFG: A fine-grained recognition scheme based on swin transformer

Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism

SwinVI:3D Swin Transformer Model with U-net for Video Inpainting.

S-Swin Transformer: simplified Swin Transformer model for offline handwritten Chinese character recognition

An Efficient FPGA-Based Accelerator for Swin Transformer

Cas-VSwin transformer: A variant swin transformer for surface-defect detection

SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation

Conv-Attention: A Low Computation Attention Calculation Method for Swin Transformer

Swin Transformer for Fast MRI