FET-OPU: A Flexible and Efficient FPGA-Based Overlay Processor for Transformer Networks

Yueyin Bai,Hao Zhou,Keqing Zhao,Hongji Wang,Jianli Chen,Jun Yu,Kun Wang
DOI: https://doi.org/10.1109/iccad57390.2023.10323752
2023-01-01
Abstract:There are already some works on accelerating transformer networks with field-programmable gate array (FPGA). However, many accelerators focus only on attention computation or suffer from fixed data streams without flexibility. Moreover, their hardware performance is limited without schedule optimization and full use of hardware resources. In this article, we propose a flexible and efficient FPGA-based overlay processor, named FET-OPU. Specifically, we design an overlay architecture for general accelerations of transformer networks. We propose a unique matrix multiplication unit (MMU), which consists of a processing element (PE) array based on modified DSP-packing technology and a FIFO array for data caching and rearrangement. An efficient non-linear function unit (NFU) is also introduced, which can calculate arbitrary single input non-linear functions. We also customize an instruction set for our overlay architecture, dynamically controlling data flows by instructions generated on the software side. In addition, we introduce a two-level compiler and optimize the parallelism and memory allocation schedule. Experimental results show that our FET-OPU achieves 7.33-21.27× speedup and 231× less energy consumption compared with CPU, and 1.56-4.08× latency reduction with 5.85-66.36× less energy consumption compared with GPU. Furthermore, we observe 1.56-8.21× better latency and 5.28-6.24× less energy consumption compared with previously customized FPGA/ASIC accelerators and can be 2.05× faster than NPE with 5.55× less energy consumption.
What problem does this paper attempt to address?