Abstract:Transformer based large language models have achieved tremendous success. However, the significant memory and computational costs incurred during the inference process make it challenging to deploy large models on resource-constrained devices. In this paper, we investigate compression and efficient inference methods for large language models from an algorithmic perspective. Regarding taxonomy, similar to smaller models, compression and acceleration algorithms for large language models can still be categorized into quantization, pruning, distillation, compact architecture design, dynamic networks. However, Large language models have two prominent characteristics compared to smaller models: (1) Most of compression algorithms require finetuning or even retraining the model after compression. The most notable aspect of large models is the very high cost associated with model finetuning or training. Therefore, many algorithms for large models, such as quantization and pruning, start to explore tuning-free algorithms. (2) Large models emphasize versatility and generalization rather than performance on a single task. Hence, many algorithms, such as knowledge distillation, focus on how to preserving their versatility and generalization after compression. Since these two characteristics were not very pronounced in early large models, we further distinguish large language models into medium models and ``real'' large models. Additionally, we also provide an introduction to some mature frameworks for efficient inference of large models, which can support basic compression or acceleration algorithms, greatly facilitating model deployment for users.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper mainly explores the problems of excessive memory and computational cost faced by large - language models (LLMs) during the reasoning process, and proposes a series of compression and efficient - reasoning methods to solve these problems. Specifically: 1. **Excessive memory and computational cost**: - Due to their large number of parameters, large - language models require a large amount of memory and computational resources during the reasoning process. For example, a model with 10 billion parameters using float32 weights will consume 37GB of memory, and the memory requirements during the reasoning process will further increase as the sequence length increases. - This makes it very difficult or even impossible to deploy these models on resource - constrained devices. 2. **The need for model compression and efficient reasoning**: - In order to deploy these models on resource - constrained devices (such as mobile devices), model compression methods need to be adopted to reduce the memory and computational cost required for reasoning. - Common model compression methods include Quantization, Pruning, Knowledge Distillation, Compact Architecture Design and Dynamic Networks. 3. **Special challenges of large models**: - **High cost of fine - tuning**: Most compression algorithms require fine - tuning or retraining of the model after compression, but the cost of fine - tuning or training of large models is very high. Therefore, researchers are exploring methods that do not require fine - tuning or are more efficient for fine - tuning. - **Generality and generalization ability**: Unlike small models that handle a single task, large - language models emphasize generality and generalization ability across tasks and unseen data. Therefore, the generality and generalization ability of the compressed large models need to be carefully verified. 4. **Classification and framework**: - The paper classifies large - language models into medium - sized models (with the number of parameters below 1 billion) and "true" large models (with the number of parameters exceeding 1 billion). Medium - sized models are relatively easy to fine - tune and show fewer emerging capabilities, so many compression methods for medium - sized models are similar to those for small models. - The paper also introduces mature frameworks that support basic compression or acceleration algorithms, which can greatly simplify the model deployment process for users. Through the above methods and frameworks, the paper aims to provide a comprehensive review to help researchers and engineers better understand and apply the compression and efficient - reasoning techniques of large - language models.

Model Compression and Efficient Inference for Large Language Models: A Survey

A Survey on Model Compression for Large Language Models

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

A Survey on Efficient Inference for Large Language Models

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

A Comprehensive Survey of Compression Algorithms for Language Models

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

The Efficiency Spectrum of Large Language Models: An Algorithmic Survey

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

A Survey on Transformer Compression

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Evaluating Large Language Models for Generalization and Robustness via Data Compression

Search for Efficient Large Language Models

A Survey on Model Compression and Acceleration for Pretrained Language Models

Language Modeling Is Compression

Efficient Large Language Models: A Survey

Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design

A Comprehensive Study on Quantization Techniques for Large Language Models

Aggressive Post-Training Compression on Extremely Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading