LTrans-OPU: A Low-Latency FPGA-Based Overlay Processor for Transformer Networks

Yueyin Bai,Hao Zhou,Keqing Zhao,Manting Zhang,Jianli Chen,Jun Yu,Kun Wang
DOI: https://doi.org/10.1109/fpl60245.2023.00048
2023-01-01
Abstract:Existing accelerators for transformer networks with field-programmable gate array (FPGA) either focus only on attention computation or suffer from fixed data streams without flexibility. Moreover, compression and approximation methods of transformer networks have the potential for further optimization. In this article, we propose a low-latency FPGA-based overlay processor, named LTrans-OPU for general accelerations of transformer networks. Specifically, we design a domain-specific overlay architecture, including a computation unit for matrix multiplication of arbitrary dimensions. An instruction set customized for our overlay architecture is also introduced, dynamically controlling data flows by generated instructions. In addition, we introduce a hybrid pruning method common to various transformer networks, along with an efficient non-linear function approximation method. Experimental results show that our design is rather competitive and has low latency. LTrans-OPU achieves 11.10-32.20× speedup compared with CPU and 2.44-6.18 × latency reduction compared with GPU. We also observe 2.36-12.43 × lower latency compared with customized FPGA/ASIC accelerators, and can be 3.10× faster than NPE.
What problem does this paper attempt to address?