Fitop-Trans: Maximizing Transformer Pipeline Efficiency Through Fixed-Length Token Pruning on FPGA

Kejia Shi,Manting Zhang,Keqing Zhao,Xiaoxing Wu,Yang Liu,Jun Yu,Kun Wang
DOI: https://doi.org/10.1109/fpl64840.2024.00041
2024-01-01
Abstract:Recent years have witnessed Transformers emerge as a groundbreaking innovation in the Natural Language Processing (NLP) field. Unlike Recurrent Neural Network (RNN) models, Transformers process sequences in parallel, boosting accuracy for longer sequences. However, Transformers face challenges with extended processing time. This is particularly due to the requirement of padding inputs to match the longest sentence in a batch, thereby increasing computational demands. In this paper, we present Fitop-Trans, the first algorithm-hardware co-optimized framework using Fixed-Length Token Pruning strategy while deploying Transformers on FPGA. At the algorithmic level, we propose Fixed-Length Token Pruning. It is a novel pruning method which can maximize hardware efficiency in attention computation, aimed at eliminating unimportant tokens before the first layer. On the hardware side, a token selector is designed for Fixed-Length Token Pruning, which minimizes off-chip memory traffic. In addition, a partitionable Systolic Array (SA) is adopted, which is capable of handling varying input lengths and maximizing Digital Signal Processor (DSP) resource utilization. Furthermore, a scheduling module is designed to optimize hardware resource allocation and enhance pipeline attention throughput. Experimental results reveal that our hardware design on FPGA achieves a speedup of $580 \times$ and $6.39 \times$ in latency compared to Intel Xeon Gold CPU and NVIDIA GeForce RTX 3090.
What problem does this paper attempt to address?