Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform. This provides a systematic and comprehensive summary of existing inference acceleration work by integrating software optimization methods and hardware platforms, which can point to the future trends and potential developments of generative LLMs and hardware technology for edge-side scenarios.

Generative Inference of Large Language Models in Edge Computing: An Energy Efficient Approach

Large Language Models (llms) Inference Offloading and Resource Allocation in Cloud-Edge Networks: an Active Inference Approach

Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks

Decentralized LLM Inference over Edge Networks with Energy Harvesting

Empowering Large Language Models to Edge Intelligence: A Survey of Edge Efficient LLMs and Techniques

Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing

Efficient and Economic Large Language Model Inference with Attention Offloading

Mobile Edge Intelligence for Large Language Models: A Contemporary Survey

Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

An Empirical Analysis and Resource Footprint Study of Deploying Large Language Models on Edge Devices

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Adaptive Layer Splitting for Wireless LLM Inference in Edge Computing: A Model-Based Reinforcement Learning Approach

Retrieval-Augmented Generation for Mobile Edge Computing via Large Language Model

Delay-Optimal Computation Offloading in Large-Scale Multi-Access Edge Computing Using Mean Field Game

LLMCad: Fast and Scalable On-device Large Language Model Inference

A Review on Edge Large Language Models: Design, Execution, and Applications

Toward Democratized Generative AI in Next-Generation Mobile Edge Networks

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Generative AI on the Edge: Architecture and Performance Evaluation