LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

Han Xu,Yutong Li,Shihao Ji
2024-09-13
Abstract:Large language models (LLMs) have demonstrated remarkable abilities in natural language processing. However, their deployment on resource-constrained embedded devices remains difficult due to memory and computational demands. In this paper, we present an FPGA-based accelerator designed to improve LLM inference performance on embedded FPGAs. We employ post-training quantization to reduce model size and optimize for off-chip memory bandwidth. Our design features asynchronous computation and a fully pipelined accelerator for matrix-vector multiplication. Experiments of the TinyLlama 1.1B model on a Xilinx ZCU102 platform show a 14.3-15.8x speedup and a 6.1x power efficiency improvement over running exclusively on ZCU102 processing system (PS).
Hardware Architecture
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the challenge of deploying large language models (LLMs) on resource-constrained embedded devices. Specifically, although LLMs perform excellently in natural language processing (NLP) tasks, their demand for memory and computational resources makes it difficult to deploy them on Internet of Things (IoT) devices. The paper proposes an FPGA-based accelerator design—LlamaF, to enhance the inference performance of LLMs on embedded FPGAs. The main contributions of the paper are as follows: 1. **Quantization Strategy**: Reducing model size through post-training quantization and optimizing off-chip memory bandwidth utilization. 2. **Fully Pipelined Accelerator**: Proposing a fully pipelined accelerator for group quantized matrix-vector multiplication (GQMV). 3. **Asynchronous Computation**: Implementing asynchronous computation on FPGA during weight transfer, significantly improving performance. 4. **Experimental Validation**: Accelerating the TinyLlama 1.1B model on the Xilinx ZCU102 platform, experiments demonstrate that LlamaF achieves a 14.3 to 15.8 times performance improvement compared to using only the ZCU102 processing system, and a 6.1 times increase in energy efficiency. This is the first work to apply the Llama2 architecture acceleration to embedded FPGAs.