Abstract:Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. In this paper, we present a comprehensive survey of the several research efforts that have been presented for the acceleration of transformer networks for Large Language Models using hardware accelerators. The survey presents the frameworks that have been proposed and then performs a qualitative and quantitative comparison regarding the technology, the processing platform (FPGA, ASIC, In-Memory, GPU), the speedup, the energy efficiency, the performance (GOPs), and the energy efficiency (GOPs/W) of each framework. The main challenge in comparison is that every proposed scheme is implemented on a different process technology making hard a fair comparison. The main contribution of this paper is that we extrapolate the results of the performance and the energy efficiency on the same technology to make a fair comparison; one theoretical and one more practical. We implement part of the LLMs on several FPGA chips to extrapolate the results to the same process technology and then we make a fair comparison of the performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to effectively improve the performance and energy efficiency of large - scale language models (LLMs) through hardware accelerators**. Specifically, this review article focuses on using hardware accelerators (such as FPGA, ASIC, in - memory computing architectures, and GPU) to accelerate Transformer Networks to meet the high - computational demands of large - scale language models in natural language processing (NLP) tasks. The main problems include: 1. **Performance improvement**: How to significantly increase the inference and training speed of large - language models through hardware accelerators. 2. **Energy - efficiency optimization**: How to reduce energy consumption while improving performance, enabling these models to operate efficiently in resource - constrained environments. 3. **Fair comparison**: Due to the different process technologies used in different studies, it is difficult to conduct a fair performance and energy - efficiency comparison. Therefore, how to reasonably evaluate the performance and energy - efficiency of different schemes under the same process technology is also an important issue. ### Main contributions To address the above problems, the main contributions of this paper include: - **Comprehensive survey**: Conducted an extensive survey of existing hardware acceleration schemes, covering various frameworks based on FPGA, ASIC, in - memory computing architectures, and GPU. - **Performance and energy - efficiency comparison**: Conducted qualitative and quantitative comparisons of different frameworks from multiple dimensions (such as technology, processing platform, speedup ratio, energy - efficiency, performance (GOPs), energy - efficiency (GOPs/W)). - **Fair performance and energy - efficiency evaluation**: By implementing part of the LLM on multiple FPGA chips and extrapolating the results to the same technology node, a fair performance and energy - efficiency comparison is achieved. ### Solution overview The paper details a variety of hardware acceleration schemes, including but not limited to: - **FTRANS**: Compresses the model through block - circulant matrix (BCM), achieving a 16 - fold compression rate while maintaining high accuracy. - **Multi - Head Attention acceleration**: Proposed a dedicated hardware accelerator for the most computationally - intensive parts in the Transformer network - multi - head attention mechanism (MHA) and position - wise feed - forward network (FFN). - **Compressed Block Row (CBR)**: Proposed an effective sparse matrix storage structure by combining algorithm - level balanced model compression and hardware - level optimized design. - **ViA**: An FPGA - acceleration architecture proposed for Vision Transformer (ViT), which significantly improves speed and energy - efficiency. - **FlexRun**: An FPGA accelerator that supports multiple complex NLP models (such as RNN, LSTM, Transformer, GPT2), with a significant speed improvement compared to existing schemes. Through these methods, this paper not only summarizes the current research progress but also provides important references and directions for future research.

Hardware Acceleration of LLMs: A comprehensive survey and comparison

A Survey on Hardware Accelerators for Large Language Models

A Hardware Evaluation Framework for Large Language Model Inference

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators

A Comprehensive Evaluation of FPGA-Based Spatial Acceleration of LLMs

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

New Solutions on LLM Acceleration, Optimization, and Application

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

Hardware-friendly compression and hardware acceleration for transformer: A survey

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine

Efficient and Economic Large Language Model Inference with Attention Offloading

The Efficiency Spectrum of Large Language Models: An Algorithmic Survey

Accelerating Neural Networks for Large Language Models and Graph Processing with Silicon Photonics

A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models

MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition

EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models

Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey