A Survey on Efficient Inference for Large Language Models

Zixuan Zhou,Xuefei Ning,Ke Hong,Tianyu Fu,Jiaming Xu,Shiyao Li,Yuming Lou,Luning Wang,Zhihang Yuan,Xiuhong Li,Shengen Yan,Guohao Dai,Xiao-Ping Zhang,Yuhan Dong,Yu Wang
2024-07-19
Abstract:Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadratic-complexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily discusses how Large Language Models (LLMs) excel in various tasks, but their deployment in resource-constrained scenarios is challenging due to their immense computational and memory requirements. The paper aims to provide a comprehensive review of the existing literature on enhancing the inference efficiency of LLMs, analyzing the main reasons for the inefficiency of LLM inference, including the oversized model dimensions, the quadratic complexity of attention operations, and autoregressive decoding methods, and proposes a comprehensive classification system that includes data-level, model-level, and system-level optimizations. Moreover, the paper compares and analyzes representative methods in key subfields to provide quantitative insights, summarize knowledge, and discuss future research directions. Specifically, the paper first introduces the basic concepts and knowledge of LLMs, analyzes the efficiency bottlenecks during the inference process in detail, and then proposes a classification method that divides related research into three levels: data-level optimization, model-level optimization, and system-level optimization. Data-level optimization improves efficiency by optimizing input prompts or organizing output content; model-level optimization involves designing efficient model structures or compressing pre-trained models; system-level optimization focuses on optimizing the inference engine or service system, which usually does not involve expensive model training and is non-detrimental to model performance. Finally, the paper provides practical suggestions and guidance through experimental analysis to promote the development of future research.