FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

Shulin Zeng,Jun Liu,Guohao Dai,Xinhao Yang,Tianyu Fu,Hongyi Wang,Wenheng Ma,Hanbo Sun,Shiyao Li,Zixiao Huang,Yadong Dai,Jintao Li,Zehao Wang,Ruoyu Zhang,Kairui Wen,Xuefei Ning,Yu Wang
2024-01-09
Abstract:Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads.
Hardware Architecture,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the high computational and memory overhead issues faced during the inference process of large - language models (LLMs). Specifically: 1. **Low computational efficiency**: Existing hardware platforms such as GPUs do not support compressed LLMs well, especially when dealing with unstructured sparse patterns, resulting in low computational efficiency. 2. **Low memory bandwidth utilization**: During the decoding stage, LLMs frequently access fine - grained data from off - chip memory, resulting in low memory bandwidth utilization (29 - 43%). 3. **High compilation overhead**: Due to the large design space composed of the dynamic sparse patterns and input lengths of LLMs, the storage overhead for generating instruction files is huge (for example, an instruction file with an input token length of 2048 requires approximately terabyte - level storage on FPGA). To solve these problems, the paper proposes **FlightLLM**, which is an FPGA - based efficient LLM inference framework. It improves computational and memory efficiency through the following innovative solutions: 1. **Configurable sparse DSP chains**: Support different sparse patterns to improve computational efficiency. 2. **On - chip continuous decoding scheme**: Utilize mixed - precision support to reduce off - chip memory access and improve memory bandwidth utilization. 3. **Length - adaptive compilation method**: Reduce compilation overhead, enabling LLMs in practical applications to be deployed on FPGAs. These solutions enable FlightLLM to achieve higher energy efficiency and cost - effectiveness than commercial GPUs (such as NVIDIA V100S) on Xilinx Alveo U280 FPGA, and have a higher throughput than NVIDIA A100 GPU on the latest Versal VHK158 FPGA.