Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Georgy Tyukin
2024-04-03
Abstract:Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the performance of larger models, but with a reduced cost of running them. In this thesis we explore the methods of model compression, and we empirically demonstrate that the simple method of skipping latter attention sublayers in Transformer LLMs is an effective method of model compression, as these layers prove to be redundant, whilst also being incredibly computationally expensive. We observed a 21% speed increase in one-token generation for Llama 2 7B, whilst surprisingly and unexpectedly improving performance over several common benchmarks.
Machine Learning,Artificial Intelligence,Computation and Language,Performance
What problem does this paper attempt to address?
The paper primarily explores the issue of inference efficiency in large language models (LLMs) and proposes a method to enhance the inference efficiency of LLMs through optimization strategies and architectural innovations. Specifically, as the number of parameters in large language models continues to increase, although the training speed has improved, the inference cost has also significantly increased. Therefore, the authors of the paper studied model compression techniques aimed at reducing operational costs while retaining the performance of large models. One effective method mentioned in the paper is to skip the rear attention sub-layers in the Transformer structure of LLMs, as these layers are both redundant and computationally expensive. Experimental results show that using this method on the Llama 2 7B model can improve the one-word generation speed by 21% and unexpectedly enhance performance on multiple benchmarks. In summary, the paper aims to address the inference efficiency issues caused by the scale growth of large language models. Through empirical research, it demonstrates a simple and effective model compression method—skipping specific Transformer sub-layers, which helps reduce the demand for computational resources and improve processing speed.