Abstract:Recent work on pruning large language models (LLMs) has shown that one can eliminate a large number of parameters without compromising performance, making pruning a promising strategy to reduce LLM model size. Existing LLM pruning strategies typically assign uniform pruning ratios across layers, limiting overall pruning ability; and recent work on layerwise pruning of LLMs is often based on heuristics that can easily lead to suboptimal performance. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory, in particular the shape of empirical spectral densities (ESDs) of weight matrices, to design improved layerwise pruning ratios for LLMs. Our analysis reveals a wide variability in how well-trained, and thus relatedly how prunable, different layers of an LLM are. Based on this, we propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. AlphaPruning can be used in conjunction with multiple existing LLM pruning methods. Our empirical results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs. We have open-sourced our code at <a class="link-external link-https" href="https://github.com/haiquanlu/AlphaPruning" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "AlphaPruning: Improving Inter - layer Pruning in Large Language Models Using Heavy - tailed Self - Regularization Theory" aims to solve several key problems in existing pruning methods for large language models (LLMs): 1. **Limitations of uniform pruning ratios**: - Existing LLM pruning strategies usually allocate the same pruning ratio among layers, which limits the overall pruning ability. This uniform pruning method makes it difficult to achieve very high sparsity, and thus cannot significantly reduce the number of model parameters. 2. **Deficiencies in heuristic - based inter - layer pruning methods**: - Some recent works attempt to allocate inter - layer pruning ratios through heuristic methods (such as Outlier Weighed Layerwise sparsity, OWL, which is based on outlier activation). However, these methods rely on the existence of outliers in the model. If there are no obvious outliers in the model, these methods may lead to sub - optimal performance. 3. **Lack of theoretically - guided pruning ratio allocation**: - Currently, few studies are dedicated to developing theoretically - sound methods to calculate inter - layer pruning ratios. Most existing methods are based on heuristic or empirical indicators and lack a solid theoretical foundation. ### Solutions To solve the above problems, the paper proposes a new inter - layer pruning method - AlphaPruning. The main contributions of this method are as follows: 1. **Utilizing Heavy - tailed Self - Regularization (HT - SR) theory**: - By analyzing the shape of the empirical spectral density (ESD) of the model weight matrix, the training quality of each layer is quantified. Specifically, the paper uses the power - law distribution (PL) to fit the ESD and extracts the power - law index (PL_Alpha_Hill) as an important indicator for measuring the inter - layer pruning ratio. 2. **More reasonable inter - layer pruning ratio allocation**: - According to the training quality of each layer (i.e., the PL_Alpha_Hill value), different pruning ratios are dynamically allocated. Layers that are better trained (with lower PL_Alpha_Hill values) are allocated lower pruning ratios to preserve their performance; while layers that are poorly trained (with higher PL_Alpha_Hill values) are allocated higher pruning ratios to further reduce the number of parameters. 3. **Extensive experimental verification**: - The paper conducts a comprehensive experimental evaluation on multiple LLM architectures, including models such as LLaMA, OPT, Vicuna, and Mistral. The experimental results show that AlphaPruning can achieve a sparsity of up to 80% while maintaining model performance, significantly outperforming existing pruning methods. ### Main findings - **Shape indicators are superior to scale indicators**: - Through a systematic evaluation of multiple weight matrix indicators, the paper finds that shape indicators (such as PL_Alpha_Hill) are superior to scale indicators (such as Frobenius norm and spectral norm) in allocating inter - layer sparsity. - **Significant performance improvement**: - AlphaPruning achieves significant performance improvement on multiple LLM models, especially in zero - shot tasks. For example, at a 70% sparsity, AlphaPruning reduces the perplexity of the LLaMA - 7B model by about 99% and increases the average accuracy of zero - shot tasks by 8.79%. ### Conclusion AlphaPruning, by introducing the heavy - tailed self - regularization theory, provides a theoretically - sound and effective inter - layer pruning method. It can significantly reduce the number of LLM parameters without sacrificing model performance, providing a new solution for the efficient deployment of large - scale language models.

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Reassessing Layer Pruning in LLMs: New Insights and Methods

Pruning Foundation Models for High Accuracy without Retraining

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

A Simple and Effective Pruning Approach for Large Language Models

Large Language Model Pruning

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

SlimGPT: Layer-wise Structured Pruning for Large Language Models

MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations

BlockPruner: Fine-grained Pruning for Large Language Models

DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

LLM-Pruner: On the Structural Pruning of Large Language Models

Pruning as a Domain-specific LLM Extractor

Fast and Effective Weight Update for Pruned Large Language Models

PAT: Pruning-Aware Tuning for Large Language Models