Abstract:Transformer based large language models have achieved tremendous success. However, the significant memory and computational costs incurred during the inference process make it challenging to deploy large models on resource-constrained devices. In this paper, we investigate compression and efficient inference methods for large language models from an algorithmic perspective. Regarding taxonomy, similar to smaller models, compression and acceleration algorithms for large language models can still be categorized into quantization, pruning, distillation, compact architecture design, dynamic networks. However, Large language models have two prominent characteristics compared to smaller models: (1) Most of compression algorithms require finetuning or even retraining the model after compression. The most notable aspect of large models is the very high cost associated with model finetuning or training. Therefore, many algorithms for large models, such as quantization and pruning, start to explore tuning-free algorithms. (2) Large models emphasize versatility and generalization rather than performance on a single task. Hence, many algorithms, such as knowledge distillation, focus on how to preserving their versatility and generalization after compression. Since these two characteristics were not very pronounced in early large models, we further distinguish large language models into medium models and ``real'' large models. Additionally, we also provide an introduction to some mature frameworks for efficient inference of large models, which can support basic compression or acceleration algorithms, greatly facilitating model deployment for users.

A Survey on Model Compression and Acceleration for Pretrained Language Models

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

Model Compression and Efficient Inference for Large Language Models: A Survey

A Survey on Model Compression for Large Language Models

A Survey on Transformer Compression

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

Compression of Deep Learning Models for Text: A Survey

A Comprehensive Survey of Compression Algorithms for Language Models

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

A Survey on Transformers in NLP with Focus on Efficiency

Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules

Great Power, Great Responsibility: Recommendations for Reducing Energy for Training Language Models

The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models

Model Compression for Deep Neural Networks: A Survey

Prompt Compression for Large Language Models: A Survey

COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

A Survey of Model Compression and Acceleration for Deep Neural Networks.

Exploring Extreme Parameter Compression for Pre-trained Language Models

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

Aggressive Post-Training Compression on Extremely Large Language Models