Abstract:Network Pruning is a promising way to address the huge computing resource demands of the deployment and inference of Large Language Models (LLMs). Retraining-free is important for LLMs' pruning methods. However, almost all of the existing retraining-free pruning approaches for LLMs focus on unstructured pruning, which requires specific hardware support for acceleration. In this paper, we propose a novel retraining-free structured pruning framework for LLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is hardware-friendly by effectively reducing storage and enhancing inference speed. For effective structured pruning of LLMs, we highlight three critical elements that demand the utmost attention: formulating structured importance metrics, adaptively searching the global compressed model, and implementing compensation mechanisms to mitigate performance loss. First, FLAP determines whether the output feature map is easily recoverable when a column of weight is removed, based on the fluctuation pruning metric. Then it standardizes the importance scores to adaptively determine the global compressed model structure. At last, FLAP adds additional bias terms to recover the output feature maps using the baseline values. We thoroughly evaluate our approach on a variety of language benchmarks. Without any retraining, our method significantly outperforms the state-of-the-art methods, including LLM-Pruner and the extension of Wanda in structured pruning. The code is released at <a class="link-external link-https" href="https://github.com/CASIA-IVA-Lab/FLAP" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform structured pruning on large - scale language models (LLMs) without retraining, in order to reduce their computational resource requirements while maintaining model performance. Specifically, the paper proposes a new method named FLAP (Fluctuation - Based Adaptive Structured Pruning), aiming to achieve this goal through the following three key aspects:
1. **Structured Importance Measurement**: Design a measurement index that can capture the importance of the entire row or column weight set, so as to identify and remove redundant structured parts.
2. **Adaptive Search for Global Compression Model Structure**: Develop a mechanism that can adaptively search for the optimal global compression model structure.
3. **Compensation Strategy**: Introduce a compensation mechanism to minimize the performance degradation caused by pruning.
The paper points out that although existing unstructured pruning methods can reduce the number of parameters, they require specific hardware support to accelerate inference, while structured pruning can simultaneously reduce the number of parameters and inference time without relying on specific hardware, so it is more suitable for large - scale deployment. However, existing structured pruning methods usually rely on retraining, which is difficult to achieve when computational resources are limited. Therefore, FLAP proposes a structured pruning framework without retraining, and effectively restores the pruned model performance by introducing a baseline - bias compensation mechanism.
### Main Contributions
- **Proposed FLAP**: A structured pruning framework without retraining, which for the first time identifies the structured sample stability characteristics in LLMs.
- **Baseline - Bias Compensation Mechanism**: A method that can restore pruning performance without retraining, especially suitable for high - pruning - ratio situations.
- **Significant Performance Improvement**: In multiple language benchmark tests, FLAP can significantly outperform existing methods without retraining, including the extended versions of LLM - Pruner and Wanda.
### Method Overview
The methodology of FLAP mainly includes three core components:
1. **Baseline - Bias Compensation**: Compensate the influence on the output feature map during the pruning process by adding an additional bias term.
2. **Structured Fluctuation Measurement**: Calculate the fluctuation degree of each channel based on the calibration dataset, and evaluate the recovery potential of the output feature map after pruning.
3. **Adaptive Structure Search**: Adaptively determine the global compression model structure by normalizing the fluctuation measurements of different layers and modules.
### Experimental Results
The paper conducted experiments on the LLaMA model family, and the results show that FLAP can significantly outperform other methods under different pruning ratios, especially when the pruning ratio is high, the performance advantage is more obvious. In addition, the performance of FLAP in zero - sample tasks is also better than that of other methods, further verifying its effectiveness in maintaining the generalization ability of the model.
### Conclusion
FLAP successfully realizes structured pruning without retraining by introducing baseline - bias compensation and structured fluctuation measurement, significantly reduces the computational resource requirements of large - scale language models, and at the same time maintains the model performance. This method provides a new solution for the efficient deployment of large - scale language models.