Efficient Large Language Models: A Survey

Zhongwei Wan,Xin Wang,Che Liu,Samiul Alam,Yu Zheng,Jiachen Liu,Zhongnan Qu,Shen Yan,Yi Zhu,Quanlu Zhang,Mosharaf Chowdhury,Mi Zhang
2024-05-23
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding and language generation, and thus have the potential to make a substantial impact on our society. Such capabilities, however, come with the considerable resources they demand, highlighting the strong need to develop effective techniques for addressing their efficiency challenges. In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we organize the papers featured in this survey at
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper is a review of research on the efficiency of large language models (LLMs). With outstanding performance in tasks such as natural language understanding and generation, LLMs such as Open AI's GPT series, Meta's LLaMA series, and Google's Gemini have had a significant impact on society. However, these capabilities come with significant resource demands, including increased GPU hours during training and inference, resulting in high running costs. The paper aims to systematically review and organize technical research on improving LLM efficiency. The authors categorize the relevant literature into three main categories: model-centric, data-centric, and framework-centric, covering various efficiency optimization methods such as compression, pre-training, fine-tuning, inference acceleration, and architecture design. In addition, the paper discusses the role of data quality and structure in improving LLM efficiency, as well as dedicated frameworks for LLM training, fine-tuning, inference, and serving. The paper provides a graph showing the relationship between LLM performance, training time, and inference throughput, emphasizing the trade-off between model size and resource consumption, and highlights the achievement of higher efficiency through optimization techniques, as shown by the Mistral-7B model. The authors have also established a GitHub repository for continuously updating and maintaining relevant research papers for researchers and practitioners to reference. In summary, this paper attempts to address how to reduce the resource requirements of large language models without sacrificing performance, achieving more efficient and economical operation through algorithmic improvements, data selection, and framework enhancements.