Ph.D. Project: Achieving Low-Latency Acceleration on Multi-FPGA for GPT Application

Zhendong Zheng,Teng Wang,Chao Wang
DOI: https://doi.org/10.1109/fccm60383.2024.00050
2024-01-01
Abstract:This paper proposes a latency-oriented recurrent architecture for GPT on multi-FPGA with communication optimization. We devise an efficient communication scheme that overlaps part of the computation and communication delay to improve the latency and scalability of our platform. Then, we deploy recurrent structures on each FPGA to accelerate the different phases of GPT. A preliminary experiment shows that our method can reduce the synchronization overhead and increase the computing intensity, resulting in an average 11.8 x speedup over NVIDIA V100 GPU and 3.0x speedup over existing multi-FPGA accelerator appliance on the GPT-2 model.
What problem does this paper attempt to address?