LORA: A Latency-Oriented Recurrent Architecture for GPT Model on Multi-FPGA Platform with Communication Optimization

Zhendong Zheng,Qianyu Cheng,Teng Wang,Lei Gong,Xianglan Chen,Cheng Tang,Chao Wang,Xuehai Zhou
DOI: https://doi.org/10.1109/fpl64840.2024.00053
2024-01-01
Abstract:Large Language Models (LLMs) have been widely deployed in data centers to provide various services, among which the most representative is the Generative Pre-trained Transformer (GPT). The GPT model has heavy memory and computing overhead, and its inference process has two stages with distinct computing characteristics: Prefill and Decode. Utilizing existing GPUs and FPGA accelerators to construct a platform for deploying GPT in data centers faces the challenges of needing more effective synchronization schemes or structures with higher computational intensity. This paper proposes LORA, a low latency end-to-end GPT acceleration platform utilizing multiple FPGAs. Firstly, we optimize the synchronization timing of the GPT model to reduce the computation and communication overhead. Secondly, we devise some efficient synchronization steps for specific layers of the GPT model that overlap part of the computation and communication delay to improve the latency of our platform. Finally, we deploy recurrent structures on each FPGA to accelerate the different stages of the GPT model. Implemented on the Xilinx Alveo U280 FPGAs, LORA achieves an average $11.1 \times$ speedup over NVIDIA V100 GPUs on the modern GPT-2 model. Compared to the existing multi-FPGA accelerator appliance, LORA shows performance improvements of up to $4 \times$ and $2.7 \times$ in the Prefill and Decode stages.
What problem does this paper attempt to address?