Xin Men,Mingyu Xu,Qingyu Zhang,Bingning Wang,Hongyu Lin,Yaojie Lu,Xianpei Han,Weipeng Chen
Abstract:As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in large - language models (LLMs), as the number of model parameters increases significantly, the hardware resources required for their deployment also increase substantially, which poses a huge obstacle to practical applications. Specifically, the paper focuses on the layer redundancy problem in LLMs, that is, some layers contribute very little to the overall network function. By introducing a new metric - Block Influence (BI), the authors quantify the importance of each layer, and based on this, propose a simple and effective pruning method - ShortGPT. This method reduces the model size by removing redundant layers while maintaining performance.
### Main Contributions
1. **Analysis of Layer Redundancy in LLMs**: The research found that there is significant redundancy at the layer level in LLMs. This finding inspired a method to compress LLMs by simply removing redundant layers.
2. **Proposing Block Influence (BI) as an Indicator of Layer Importance**: Based on BI, the proposed layer - removal method can reduce about 25% of the parameters while maintaining about 90% of the performance, which is better than the existing state - of - the - art methods.
3. **Demonstrating the Orthogonality between Layer Pruning and Quantization Methods**: This means that the layer pruning method can be combined with quantization techniques to further reduce the deployment cost of LLMs.
### Methodology
1. **Layer Importance**: Block Influence (BI) is introduced as a new indicator to measure the importance of each layer. The lower the BI score, the higher the cosine similarity between the input and output, which means that the layer has less transformation effect on the hidden state and is therefore less important.
2. **Layer Removal**: Layers are ranked according to their BI scores, and layers with lower BI scores are removed. Experimental results show that this method can significantly reduce the model size while maintaining high performance.
### Experimental Results
- **Performance Comparison**: Compared with existing pruning methods, ShortGPT performs well in multiple natural - language - processing benchmark tests, especially on the MMLU and Perplexity metrics.
- **The Influence of Different Pruning Ratios**: As the pruning ratio increases, the model performance gradually decreases, but the removal of some key layers will cause a sudden drop in performance, which indicates that there are some key layers in the network.
- **Redundancy in Non - Transformer Models**: The research has also been extended to non - Transformer models (such as RWKV and Mamba), and it has been found that these models also have a similar layer - redundancy phenomenon, indicating that redundancy is a common feature of current large - language models.
### Limitations
- **Performance on Generation Tasks**: The impact of layer removal on generation tasks (such as XSum and C3) is more significant than that on multiple - choice tasks, especially for smaller models. The authors speculate that this is due to the cumulative error problem faced by generation tasks, and larger models are more robust.
- **Post - training Techniques**: The performance loss caused by layer removal can be partially recovered through post - training techniques, but further exploration is required.
### Related Work
- **Model Pruning**: It includes unstructured pruning and structured pruning. Unstructured pruning simplifies the model by removing specific parameters without considering the internal structure; structured pruning is more practical and compresses the model by removing non - critical structures.
- **Quantization**: By converting floating - point numbers to integers or other discrete forms, it significantly reduces storage and computational costs.
- **Model Redundancy**: Researchers have long noticed significant redundancy in nonlinear models, and in recent years, the redundancy of the Transformer model architecture has also been studied.
In conclusion, this paper effectively solves the layer - redundancy problem in LLMs by introducing the BI indicator and the layer - removal method, providing new ideas and methods for model compression.