A Survey on Efficient Inference for Large Language Models

Zixuan Zhou,Xuefei Ning,Ke Hong,Tianyu Fu,Jiaming Xu,Shiyao Li,Yuming Lou,Luning Wang,Zhihang Yuan,Xiuhong Li,Shengen Yan,Guohao Dai,Xiao-Ping Zhang,Yuhan Dong,Yu Wang

2024-07-19

Abstract:Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadratic-complexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper primarily discusses how Large Language Models (LLMs) excel in various tasks, but their deployment in resource-constrained scenarios is challenging due to their immense computational and memory requirements. The paper aims to provide a comprehensive review of the existing literature on enhancing the inference efficiency of LLMs, analyzing the main reasons for the inefficiency of LLM inference, including the oversized model dimensions, the quadratic complexity of attention operations, and autoregressive decoding methods, and proposes a comprehensive classification system that includes data-level, model-level, and system-level optimizations. Moreover, the paper compares and analyzes representative methods in key subfields to provide quantitative insights, summarize knowledge, and discuss future research directions. Specifically, the paper first introduces the basic concepts and knowledge of LLMs, analyzes the efficiency bottlenecks during the inference process in detail, and then proposes a classification method that divides related research into three levels: data-level optimization, model-level optimization, and system-level optimization. Data-level optimization improves efficiency by optimizing input prompts or organizing output content; model-level optimization involves designing efficient model structures or compressing pre-trained models; system-level optimization focuses on optimizing the inference engine or service system, which usually does not involve expensive model training and is non-detrimental to model performance. Finally, the paper provides practical suggestions and guidance through experimental analysis to promote the development of future research.

A Survey on Efficient Inference for Large Language Models

Efficient Large Language Models: A Survey

Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design

The Efficiency Spectrum of Large Language Models: An Algorithmic Survey

Model Compression and Efficient Inference for Large Language Models: A Survey

LLM Inference Unveiled: Survey and Roofline Model Insights

LLM Inference Serving: Survey of Recent Advances and Opportunities

A Survey of Large Language Models

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

Search for Efficient Large Language Models

Efficient Multimodal Large Language Models: A Survey

A Survey on Evaluation of Large Language Models

A Survey on Evaluation of Large Language ModelsJust Accepted

Efficient and Economic Large Language Model Inference with Attention Offloading

A Survey on Model Compression for Large Language Models

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations