Abstract:Pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance in various natural language processing tasks. However, their considerable size incurs significant computational and storage costs. Modern pruning strategies employ one-shot techniques to compress PLMs without the need for retraining on task-specific or otherwise general data; however, these approaches often lead to an indispensable reduction in performance. In this paper, we propose SDS, a Sparse-Dense-Sparse pruning framework to enhance the performance of the pruned PLMs from a weight distribution optimization perspective. We outline the pruning process in three steps. Initially, we prune less critical connections in the model using conventional one-shot pruning methods. Next, we reconstruct a dense model featuring a pruning-friendly weight distribution by reactivating pruned connections with sparse regularization. Finally, we perform a second pruning round, yielding a superior pruned model compared to the initial pruning. Experimental results demonstrate that SDS outperforms the state-of-the-art pruning techniques SparseGPT and Wanda under an identical sparsity configuration. For instance, SDS reduces perplexity by 9.13 on Raw-Wikitext2 and improves accuracy by an average of 2.05% across multiple zero-shot benchmarks for OPT-125M with 2:4 sparsity.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of performance degradation in pre-trained language models (PLMs) during compression. Specifically: 1. **Compression of Large-Scale PLMs**: - Although pre-trained language models perform excellently in various natural language processing tasks, their large size leads to significant computational and storage costs. - Existing pruning strategies, such as one-shot pruning methods, can reduce the number of parameters but usually result in significant performance degradation. 2. **Pruning Challenges for Compact PLMs**: - For smaller and well-trained PLMs, existing pruning methods are less effective. The parameter distribution of these models is more uniform, making them difficult to compress. - The lack of sparse regularization during the pruning process makes direct pruning less effective. 3. **Introduction of the Sparse-Dense-Sparse (SDS) Framework**: - To overcome the above issues, the authors propose a three-step pruning framework—Sparse-Dense-Sparse (SDS), which improves the performance of the pruned model by optimizing the weight distribution. - The SDS framework includes three steps: initial pruning, reconstruction of the dense model, and secondary pruning. ### Main Contributions 1. **Introduction of the SDS Framework**: - A new three-step pruning method is proposed, which enhances the performance of pre-trained language models after one-shot pruning through weight redistribution and pruning. 2. **Design of Sparse Regularization Strategies**: - Various sparse regularization strategies are introduced to optimize the weight distribution during the reconstruction of the dense model, making it more suitable for subsequent pruning. 3. **Experimental Validation**: - Experimental results show that the SDS framework outperforms existing pruning methods, such as SparseGPT and Wanda, under the same sparse configuration. For example, on the Raw-Wikitext2 dataset, SDS reduced the perplexity by 9.13 and improved the accuracy by an average of 2.05% in multiple zero-shot benchmarks. ### Conclusion The SDS framework effectively improves the performance of pre-trained language models after pruning by optimizing the weight distribution, especially in compact models. Experimental results validate the effectiveness and superiority of the SDS framework, providing a new solution for the efficient compression of pre-trained language models.

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

Pruning Pre-trained Language Models Without Fine-Tuning

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

WRP: Weight Recover Prune for Structured Sparsity

Pruning Foundation Models for High Accuracy without Retraining

Prune Once for All: Sparse Pre-Trained Language Models

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

Pruning before Fine-tuning: A Retraining-free Compression Framework for Pre-trained Language Models

A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

SlimGPT: Layer-wise Structured Pruning for Large Language Models

DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity