Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Georgy Tyukin

2024-04-03

Abstract:Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the performance of larger models, but with a reduced cost of running them. In this thesis we explore the methods of model compression, and we empirically demonstrate that the simple method of skipping latter attention sublayers in Transformer LLMs is an effective method of model compression, as these layers prove to be redundant, whilst also being incredibly computationally expensive. We observed a 21% speed increase in one-token generation for Llama 2 7B, whilst surprisingly and unexpectedly improving performance over several common benchmarks.

Machine Learning,Artificial Intelligence,Computation and Language,Performance

What problem does this paper attempt to address?

The paper primarily explores the issue of inference efficiency in large language models (LLMs) and proposes a method to enhance the inference efficiency of LLMs through optimization strategies and architectural innovations. Specifically, as the number of parameters in large language models continues to increase, although the training speed has improved, the inference cost has also significantly increased. Therefore, the authors of the paper studied model compression techniques aimed at reducing operational costs while retaining the performance of large models. One effective method mentioned in the paper is to skip the rear attention sub-layers in the Transformer structure of LLMs, as these layers are both redundant and computationally expensive. Experimental results show that using this method on the Llama 2 7B model can improve the one-word generation speed by 21% and unexpectedly enhance performance on multiple benchmarks. In summary, the paper aims to address the inference efficiency issues caused by the scale growth of large language models. Through empirical research, it demonstrates a simple and effective model compression method—skipping specific Transformer sub-layers, which helps reduce the demand for computational resources and improve processing speed.

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Model Compression and Efficient Inference for Large Language Models: A Survey

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

Inference Performance Optimization for Large Language Models on CPUs

Large Language Models for Compiler Optimization

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

A Survey on Efficient Inference for Large Language Models

Search for Efficient Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

Large Language Model Compression with Neural Architecture Search

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models

Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction

Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference