Abstract:The tremendous success of Large Language Models (LLMs) across various complex tasks relies heavily on their substantial scale, which raises challenges during model deployment due to their large memory consumption. Recently, numerous studies have attempted to compress LLMs using one-shot pruning methods. However, these methods often experience considerable performance degradation on complex language understanding tasks, calling into question the feasibility of pruning in LLMs. To address this issue, we propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST). Unlike previous one-shot pruning methods, AST incrementally transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process. Furthermore, we observe that using distillation with a dense model as the teacher can prevent the sparse model from falling into local optima and accelerate convergence. In addition, we incorporate extra well-initialized parameters to further enhance model performance with minimal increase in memory footprint. AST can significantly enhance model performance, approaching the level of dense models. When applied to the LLaMA2-7B model, AST reduces the zero-shot accuracy gap between dense and semi-structured sparse models to 1.12% across multiple zero-shot tasks, utilizing less than 0.4% of the pretraining tokens. Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models when combined with existing quantization techniques.

Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models

Beware of Calibration Data for Pruning Large Language Models

Large Language Model Pruning

Pruning Multilingual Large Language Models for Multilingual Inference

Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency

How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark

Pruning Foundation Models for High Accuracy without Retraining

Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization

LLM-Pruner: On the Structural Pruning of Large Language Models

Structured Optimal Brain Pruning for Large Language Models

Structured Pruning of Large Language Models

Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

On the Impact of Calibration Data in Post-training Quantization and Pruning

Fluctuation-based Adaptive Structured Pruning for Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Pruning as a Domain-specific LLM Extractor

Improving Language Model Size Reduction Using Better Pruning Criteria

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models