AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Haiquan Lu,Yefan Zhou,Shiwei Liu,Zhangyang Wang,Michael W. Mahoney,Yaoqing Yang
2024-10-14
Abstract:Recent work on pruning large language models (LLMs) has shown that one can eliminate a large number of parameters without compromising performance, making pruning a promising strategy to reduce LLM model size. Existing LLM pruning strategies typically assign uniform pruning ratios across layers, limiting overall pruning ability; and recent work on layerwise pruning of LLMs is often based on heuristics that can easily lead to suboptimal performance. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory, in particular the shape of empirical spectral densities (ESDs) of weight matrices, to design improved layerwise pruning ratios for LLMs. Our analysis reveals a wide variability in how well-trained, and thus relatedly how prunable, different layers of an LLM are. Based on this, we propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. AlphaPruning can be used in conjunction with multiple existing LLM pruning methods. Our empirical results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs. We have open-sourced our code at <a class="link-external link-https" href="https://github.com/haiquanlu/AlphaPruning" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "AlphaPruning: Improving Inter - layer Pruning in Large Language Models Using Heavy - tailed Self - Regularization Theory" aims to solve several key problems in existing pruning methods for large language models (LLMs): 1. **Limitations of uniform pruning ratios**: - Existing LLM pruning strategies usually allocate the same pruning ratio among layers, which limits the overall pruning ability. This uniform pruning method makes it difficult to achieve very high sparsity, and thus cannot significantly reduce the number of model parameters. 2. **Deficiencies in heuristic - based inter - layer pruning methods**: - Some recent works attempt to allocate inter - layer pruning ratios through heuristic methods (such as Outlier Weighed Layerwise sparsity, OWL, which is based on outlier activation). However, these methods rely on the existence of outliers in the model. If there are no obvious outliers in the model, these methods may lead to sub - optimal performance. 3. **Lack of theoretically - guided pruning ratio allocation**: - Currently, few studies are dedicated to developing theoretically - sound methods to calculate inter - layer pruning ratios. Most existing methods are based on heuristic or empirical indicators and lack a solid theoretical foundation. ### Solutions To solve the above problems, the paper proposes a new inter - layer pruning method - AlphaPruning. The main contributions of this method are as follows: 1. **Utilizing Heavy - tailed Self - Regularization (HT - SR) theory**: - By analyzing the shape of the empirical spectral density (ESD) of the model weight matrix, the training quality of each layer is quantified. Specifically, the paper uses the power - law distribution (PL) to fit the ESD and extracts the power - law index (PL_Alpha_Hill) as an important indicator for measuring the inter - layer pruning ratio. 2. **More reasonable inter - layer pruning ratio allocation**: - According to the training quality of each layer (i.e., the PL_Alpha_Hill value), different pruning ratios are dynamically allocated. Layers that are better trained (with lower PL_Alpha_Hill values) are allocated lower pruning ratios to preserve their performance; while layers that are poorly trained (with higher PL_Alpha_Hill values) are allocated higher pruning ratios to further reduce the number of parameters. 3. **Extensive experimental verification**: - The paper conducts a comprehensive experimental evaluation on multiple LLM architectures, including models such as LLaMA, OPT, Vicuna, and Mistral. The experimental results show that AlphaPruning can achieve a sparsity of up to 80% while maintaining model performance, significantly outperforming existing pruning methods. ### Main findings - **Shape indicators are superior to scale indicators**: - Through a systematic evaluation of multiple weight matrix indicators, the paper finds that shape indicators (such as PL_Alpha_Hill) are superior to scale indicators (such as Frobenius norm and spectral norm) in allocating inter - layer sparsity. - **Significant performance improvement**: - AlphaPruning achieves significant performance improvement on multiple LLM models, especially in zero - shot tasks. For example, at a 70% sparsity, AlphaPruning reduces the perplexity of the LLaMA - 7B model by about 99% and increases the average accuracy of zero - shot tasks by 8.79%. ### Conclusion AlphaPruning, by introducing the heavy - tailed self - regularization theory, provides a theoretically - sound and effective inter - layer pruning method. It can significantly reduce the number of LLM parameters without sacrificing model performance, providing a new solution for the efficient deployment of large - scale language models.