One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Hang Shao,Bei Liu,Bo Xiao,Ke Zeng,Guanglu Wan,Yanmin Qian
2024-04-23
Abstract:Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problems of high inference latency and large GPU memory consumption in the practical applications of large - language models (LLMs) due to their large model sizes. To improve the efficiency of LLMs, existing research has explored methods such as quantization and pruning. However, these methods often require a large amount of retraining to restore model performance, which becomes very expensive and impractical when dealing with LLMs with billions of parameters. For this reason, this paper proposes a Hessian - sensitivity - aware hybrid sparse pruning method, aiming to prune LLMs to at least 50% sparsity without any retraining. This method adaptively allocates sparsity and reduces the error caused by pruning according to the sensitivity of each layer while maintaining the overall sparsity level. In addition, this method is compatible with quantization techniques and can further compress LLMs to achieve a higher compression ratio and lower performance loss. Specifically, the main contributions of this paper include: 1. Introducing a more comprehensive weight selection criterion - the improved saliency criterion (ISC), which combines the advantages of both OBS and OBD methods. 2. Proposing a Hessian - information - based sensitivity - aware hybrid sparse pruning strategy. 3. Experimental results show that this method achieves better perplexity and zero - shot downstream task performance than SparseGPT on multiple cutting - edge LLMs. Through the above methods, the paper not only solves the efficiency problem of LLMs in practical applications but also provides a new direction for future research, especially on how to further improve the model compression effect by combining mixed - precision quantization techniques.