Hardware Acceleration of LLMs: A comprehensive survey and comparison

Nikoletta Koilia,Christoforos Kachris
2024-09-05
Abstract:Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. In this paper, we present a comprehensive survey of the several research efforts that have been presented for the acceleration of transformer networks for Large Language Models using hardware accelerators. The survey presents the frameworks that have been proposed and then performs a qualitative and quantitative comparison regarding the technology, the processing platform (FPGA, ASIC, In-Memory, GPU), the speedup, the energy efficiency, the performance (GOPs), and the energy efficiency (GOPs/W) of each framework. The main challenge in comparison is that every proposed scheme is implemented on a different process technology making hard a fair comparison. The main contribution of this paper is that we extrapolate the results of the performance and the energy efficiency on the same technology to make a fair comparison; one theoretical and one more practical. We implement part of the LLMs on several FPGA chips to extrapolate the results to the same process technology and then we make a fair comparison of the performance.
Hardware Architecture,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to effectively improve the performance and energy efficiency of large - scale language models (LLMs) through hardware accelerators**. Specifically, this review article focuses on using hardware accelerators (such as FPGA, ASIC, in - memory computing architectures, and GPU) to accelerate Transformer Networks to meet the high - computational demands of large - scale language models in natural language processing (NLP) tasks. The main problems include: 1. **Performance improvement**: How to significantly increase the inference and training speed of large - language models through hardware accelerators. 2. **Energy - efficiency optimization**: How to reduce energy consumption while improving performance, enabling these models to operate efficiently in resource - constrained environments. 3. **Fair comparison**: Due to the different process technologies used in different studies, it is difficult to conduct a fair performance and energy - efficiency comparison. Therefore, how to reasonably evaluate the performance and energy - efficiency of different schemes under the same process technology is also an important issue. ### Main contributions To address the above problems, the main contributions of this paper include: - **Comprehensive survey**: Conducted an extensive survey of existing hardware acceleration schemes, covering various frameworks based on FPGA, ASIC, in - memory computing architectures, and GPU. - **Performance and energy - efficiency comparison**: Conducted qualitative and quantitative comparisons of different frameworks from multiple dimensions (such as technology, processing platform, speedup ratio, energy - efficiency, performance (GOPs), energy - efficiency (GOPs/W)). - **Fair performance and energy - efficiency evaluation**: By implementing part of the LLM on multiple FPGA chips and extrapolating the results to the same technology node, a fair performance and energy - efficiency comparison is achieved. ### Solution overview The paper details a variety of hardware acceleration schemes, including but not limited to: - **FTRANS**: Compresses the model through block - circulant matrix (BCM), achieving a 16 - fold compression rate while maintaining high accuracy. - **Multi - Head Attention acceleration**: Proposed a dedicated hardware accelerator for the most computationally - intensive parts in the Transformer network - multi - head attention mechanism (MHA) and position - wise feed - forward network (FFN). - **Compressed Block Row (CBR)**: Proposed an effective sparse matrix storage structure by combining algorithm - level balanced model compression and hardware - level optimized design. - **ViA**: An FPGA - acceleration architecture proposed for Vision Transformer (ViT), which significantly improves speed and energy - efficiency. - **FlexRun**: An FPGA accelerator that supports multiple complex NLP models (such as RNN, LSTM, Transformer, GPT2), with a significant speed improvement compared to existing schemes. Through these methods, this paper not only summarizes the current research progress but also provides important references and directions for future research.