Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

Guanchen Li,Xiandong Zhao,Lian Liu,Zeping Li,Dong Li,Lu Tian,Jie He,Ashish Sirasao,Emad Barsoum
2024-08-20
Abstract:Pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance in various natural language processing tasks. However, their considerable size incurs significant computational and storage costs. Modern pruning strategies employ one-shot techniques to compress PLMs without the need for retraining on task-specific or otherwise general data; however, these approaches often lead to an indispensable reduction in performance. In this paper, we propose SDS, a Sparse-Dense-Sparse pruning framework to enhance the performance of the pruned PLMs from a weight distribution optimization perspective. We outline the pruning process in three steps. Initially, we prune less critical connections in the model using conventional one-shot pruning methods. Next, we reconstruct a dense model featuring a pruning-friendly weight distribution by reactivating pruned connections with sparse regularization. Finally, we perform a second pruning round, yielding a superior pruned model compared to the initial pruning. Experimental results demonstrate that SDS outperforms the state-of-the-art pruning techniques SparseGPT and Wanda under an identical sparsity configuration. For instance, SDS reduces perplexity by 9.13 on Raw-Wikitext2 and improves accuracy by an average of 2.05% across multiple zero-shot benchmarks for OPT-125M with 2:4 sparsity.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of performance degradation in pre-trained language models (PLMs) during compression. Specifically: 1. **Compression of Large-Scale PLMs**: - Although pre-trained language models perform excellently in various natural language processing tasks, their large size leads to significant computational and storage costs. - Existing pruning strategies, such as one-shot pruning methods, can reduce the number of parameters but usually result in significant performance degradation. 2. **Pruning Challenges for Compact PLMs**: - For smaller and well-trained PLMs, existing pruning methods are less effective. The parameter distribution of these models is more uniform, making them difficult to compress. - The lack of sparse regularization during the pruning process makes direct pruning less effective. 3. **Introduction of the Sparse-Dense-Sparse (SDS) Framework**: - To overcome the above issues, the authors propose a three-step pruning framework—Sparse-Dense-Sparse (SDS), which improves the performance of the pruned model by optimizing the weight distribution. - The SDS framework includes three steps: initial pruning, reconstruction of the dense model, and secondary pruning. ### Main Contributions 1. **Introduction of the SDS Framework**: - A new three-step pruning method is proposed, which enhances the performance of pre-trained language models after one-shot pruning through weight redistribution and pruning. 2. **Design of Sparse Regularization Strategies**: - Various sparse regularization strategies are introduced to optimize the weight distribution during the reconstruction of the dense model, making it more suitable for subsequent pruning. 3. **Experimental Validation**: - Experimental results show that the SDS framework outperforms existing pruning methods, such as SparseGPT and Wanda, under the same sparse configuration. For example, on the Raw-Wikitext2 dataset, SDS reduced the perplexity by 9.13 and improved the accuracy by an average of 2.05% in multiple zero-shot benchmarks. ### Conclusion The SDS framework effectively improves the performance of pre-trained language models after pruning by optimizing the weight distribution, especially in compact models. Experimental results validate the effectiveness and superiority of the SDS framework, providing a new solution for the efficient compression of pre-trained language models.