Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators Through Attention Fusion

Yunji Qin,Wenqi Lou,Chao Wang,Lei Gong,Xuehai Zhou
DOI: https://doi.org/10.1145/3649476.3658810
2024-01-01
Abstract:Attention-based transformers have achieved significant performance breakthroughs in natural language processing (NLP) and computer vision (CV) tasks. Meanwhile, the ever-increasing length of today's input sequences puts much pressure on computing devices. FPGAs are widely used to accelerate Transformer inference due to their high energy efficiency and flexibility. However, most of the existing FPGA-based Transformer accelerators are oriented to small input lengths, making it hard to accelerate long input sequences. To this end, we design an efficient Transformer accelerator for FPGA and long-sequence input scenarios. We use the tiling softmax algorithm to fuse attention computation, eliminating the memory and bandwidth bottleneck in the attention layer and allowing our accelerator to support arbitrary input sequence lengths. We use BERT-Base on the Alveo U50 board for evaluation, and our implementation achieves computational efficiency improvements of 1.09 similar to 2.48x over prior FPGA accelerators. Besides, our accelerator can support up to 175K input sequence length when running BERT-like structures, far more than previous designs.
What problem does this paper attempt to address?