Abstract:The increasing size and complexity of Large Language Models (LLMs) pose challenges for their deployment on personal computers and mobile devices. Aggressive post-training model compression is necessary to reduce the models' size, but it often results in significant accuracy loss. To address this challenge, we propose a novel network pruning technology that utilizes over 0.7 sparsity and less than 8 bits of quantization. Our approach enables the compression of prevailing LLMs within a couple of hours while maintaining a relatively small accuracy loss. In experimental evaluations, our method demonstrates effectiveness and potential for practical deployment. By making LLMs available on domestic devices, our work can facilitate a new era of natural language processing applications with wide-ranging impacts.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to address the challenges faced when deploying large language models (LLMs) on personal computers and mobile devices. As the scale and complexity of LLMs keep increasing, their deployment on these devices is becoming more and more difficult. To reduce the model size, aggressive post - training model compression is required, but this usually leads to a significant loss of accuracy. Therefore, the paper proposes a new network pruning technique, using a sparsity of over 0.7 and quantization of less than 8 bits, to compress existing LLMs within a few hours while maintaining a relatively small loss of accuracy. ### Specific problem descriptions 1. **Model size and complexity**: The number of parameters in LLMs reaches tens of billions, which makes them only runnable on limited platforms, and aggressive parameter compression is required for deployment on low - end devices. 2. **Limitations of existing methods**: Although existing neural network compression methods perform well on small models, they require thousands of GPU hours when applied to LLMs, making them impractical. 3. **Trade - off between sparsity and accuracy**: Increasing sparsity can further reduce the model size, but an increase in sparsity will lead to an exponential growth in perplexity, thus affecting the model performance. 4. **Time complexity**: Existing pruning methods assume that all weights are pruned in order when calculating the inverse of the Hessian matrix, which will lead to a significant increase in time complexity in large - scale LLMs. ### Solutions The paper proposes a layer - based sparsity scheduler, which solves the above problems through the following steps: 1. **Layer sparsity scheduler**: Use the inverse of the Hessian matrix to estimate the weight update terms and select the optimal sparsity level for each layer. 2. **Log - level clustering**: Effectively control the sparsity distribution and perplexity by performing log - level clustering on the estimation error. 3. **Explain the validity of the sequential pruning assumption**: Provide a formal explanation of the validity of the sequential pruning assumption when pre - calculating the inverse of the Hessian matrix. 4. **Performance at high sparsity**: For the first time, achieve high sparsity (> 0.7) for LLMs with perplexity close to that of the dense model. 5. **Compatibility with quantization techniques**: Be compatible with the quantization technique of converting FP16 weights to INT4 to further compress LLMs. ### Experimental results The paper conducted experiments on models such as OPT - 66B and BLOOM - 176B. The results show that this method outperforms existing state - of - the - art methods (such as SparseGPT) in terms of perplexity and performs better in most cases. In addition, this method is also applicable to smaller LLMs, and also achieves good results when using a narrower sparsity range on the OPT - 6.7B model. ### Conclusions The paper proposes a score - based layer sparsity scheduler, which can achieve high - sparsity LLM compression while maintaining model performance. This method performs well in experiments and provides an effective solution for the deployment of LLMs on personal devices. Future work can explore quantitative indicators for determining the optimal sparsity range and study the relationship between sparsity and other factors (such as speed and memory usage).

Aggressive Post-Training Compression on Extremely Large Language Models

A Survey on Model Compression for Large Language Models

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

Compressing Large Language Models by Joint Sparsification and Quantization

On the Compressibility of Quantized Large Language Models

Search for Efficient Large Language Models

Data-freeWeight Compress and Denoise for Large Language Models

Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models

Activation Sparsity Opportunities for Compressing General Large Language Models

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning

Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

LLM-Pruner: On the Structural Pruning of Large Language Models

Foundations of Large Language Model Compression -- Part 1: Weight Quantization

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

Pruning Large Language Models via Accuracy Predictor