Abstract:Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.

What problem does this paper attempt to address?

This paper attempts to solve the problems of high inference latency and large GPU memory consumption in the practical applications of large - language models (LLMs) due to their large model sizes. To improve the efficiency of LLMs, existing research has explored methods such as quantization and pruning. However, these methods often require a large amount of retraining to restore model performance, which becomes very expensive and impractical when dealing with LLMs with billions of parameters. For this reason, this paper proposes a Hessian - sensitivity - aware hybrid sparse pruning method, aiming to prune LLMs to at least 50% sparsity without any retraining. This method adaptively allocates sparsity and reduces the error caused by pruning according to the sensitivity of each layer while maintaining the overall sparsity level. In addition, this method is compatible with quantization techniques and can further compress LLMs to achieve a higher compression ratio and lower performance loss. Specifically, the main contributions of this paper include: 1. Introducing a more comprehensive weight selection criterion - the improved saliency criterion (ISC), which combines the advantages of both OBS and OBD methods. 2. Proposing a Hessian - information - based sensitivity - aware hybrid sparse pruning strategy. 3. Experimental results show that this method achieves better perplexity and zero - shot downstream task performance than SparseGPT on multiple cutting - edge LLMs. Through the above methods, the paper not only solves the efficiency problem of LLMs in practical applications but also provides a new direction for future research, especially on how to further improve the model compression effect by combining mixed - precision quantization techniques.

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

LLM-Pruner: On the Structural Pruning of Large Language Models

Large Language Model Pruning

Adaptive Pruning for Large Language Models with Structural Importance Awareness

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Pruning Large Language Models via Accuracy Predictor

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing

A Simple and Effective Pruning Approach for Large Language Models

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations