Abstract:The rise of large language models (LLMs) has significantly advanced various natural language processing (NLP) tasks. However, the resource demands of these models pose substantial challenges. Structured pruning is an effective approach to reducing model size, but it often results in significant accuracy degradation, necessitating parameter updates to adapt. Unfortunately, such fine-tuning requires substantial memory, which limits its applicability. To address these challenges, we introduce quantization into the structured pruning framework to reduce memory consumption during both fine-tuning and inference. However, the combined errors from pruning and quantization increase the difficulty of fine-tuning, requiring a more refined quantization scheme. To this end, we propose QPruner, a novel framework that employs structured pruning to reduce model size, followed by a layer-wise mixed-precision quantization scheme. Quantization precisions are assigned to each layer based on their importance to the target task, and Bayesian optimization is employed to refine precision allocation strategies, ensuring a balance between model accuracy and memory efficiency. Extensive experiments on benchmark datasets demonstrate that QPruner significantly outperforms existing methods in memory savings while maintaining or improving model performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: after structured pruning of large language models (LLMs), although the model size can be significantly reduced, it often leads to a substantial decline in model accuracy. In order to restore the performance of the pruned model, fine - tuning is usually required, but this requires a large amount of memory resources, limiting its applicability. In addition, simply combining pruning and quantization techniques cannot well balance model performance and memory efficiency, because the differences in the importance of different layers will lead to the accumulation of quantization errors, further affecting the model performance. To this end, the authors propose the QPruner framework, aiming to solve these problems in the following ways: 1. **Combining structured pruning and quantization**: First, apply structured pruning to reduce the model size, and then introduce quantization to further reduce memory consumption. 2. **Hierarchical mixed - precision quantization**: Allocate different quantization precisions according to the importance of each layer for the target task, ensuring that important layers can maintain higher precision, thereby better controlling the overall performance. 3. **Bayesian optimization**: Use Bayesian optimization to improve the quantization precision allocation strategy, ensuring the best balance between model accuracy and memory efficiency. 4. **Efficient fine - tuning**: Adopt a parameter - efficient fine - tuning (PEFT) strategy to restore model performance. Specifically, the goal of QPruner is to significantly reduce the memory consumption of LLMs during fine - tuning and inference stages while maintaining or improving model performance. Through these methods, QPruner can provide better adaptability and performance in resource - constrained scenarios. ### Main contributions - Proposed the QPruner framework, which integrates structured pruning and quantization to significantly reduce the memory consumption of LLMs during fine - tuning and inference stages. - Introduced a hierarchical mixed - precision quantization scheme based on task importance and used Bayesian optimization to further optimize the precision configuration strategy. - Experimental results show that QPruner can save at least 30% of memory while increasing the accuracy rate by up to 6%. ### Formula representation Some of the formulas involved in the paper are as follows: - Quantization process: \[ X_{\text{INT}} = \text{round}\left(\frac{(2^N - 1)F(X_{\text{HP}})}{1}\right) \] where \( F(X) = \frac{X - X_{\min}}{X_{\max} - X_{\min}} \) is the normalization function. - Mutual information calculation: \[ I(X;Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)} \] - Hierarchical importance estimation: \[ I_W^i = |L(W_i(D)) - L(W_i = 0(D))| \] - Second - order Taylor expansion approximation: \[ \left| \frac{\partial L(D)}{\partial W_i} W_i - \frac{1}{2} W_i^\top H W_i \right| \] These formulas help to understand how QPruner evaluates and optimizes different parts of the model to achieve efficient compression and performance restoration.

QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models

Structured Pruning of Large Language Models

BlockPruner: Fine-grained Pruning for Large Language Models

LLM-Pruner: On the Structural Pruning of Large Language Models

Structured Optimal Brain Pruning for Large Language Models

DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

Pruning as a Domain-specific LLM Extractor

Pruning Foundation Models for High Accuracy without Retraining

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

Evaluating Quantized Large Language Models

Fluctuation-based Adaptive Structured Pruning for Large Language Models

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models

Structured Pruning Learns Compact and Accurate Models

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models