Abstract:The impressive performance of Large Language Models (LLMs) across various natural language processing tasks comes at the cost of vast computational resources and storage requirements. One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Yet, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression. In this paper, we introduce ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. ALPS substantially outperforms state-of-the-art methods in terms of the pruning objective and perplexity reduction, particularly for highly sparse models. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.

What problem does this paper attempt to address?

The paper primarily addresses the issues of significant computational resource consumption and high storage demands in large language models (LLMs) by proposing a new framework called ALPS (ADMM-based LLM Pruning in one Shot). The aim is to reduce the number of model parameters through one-shot pruning techniques, thereby improving model efficiency. Specifically, ALPS addresses the following issues: 1. **Resource consumption of large-scale language models**: While LLMs perform excellently in natural language processing tasks, their large scale brings about high computational costs and storage demands. 2. **Application of one-shot pruning techniques**: Unlike traditional iterative pruning methods that require retraining, one-shot pruning techniques can remove redundant weights without additional training, thus reducing computational burden. 3. **Insufficient application of optimization techniques in pruning**: Due to the massive scale of LLMs, existing pruning methods often rely on heuristic algorithms rather than optimization techniques, which may lead to suboptimal compression results. To solve the above problems, ALPS proposes a framework based on the Operator Splitting Technique, specifically the Alternating Direction Method of Multipliers (ADMM), combined with a post-processing step using the Preconditioned Conjugate Gradient (PCG) method to achieve effective pruning of model weights. Specifically: - ALPS formulates the pruning problem as an optimization problem with `ℓ0` constraints and directly uses ADMM to solve this problem, simultaneously determining the support set of weights and updating these weights. - Once the support set stabilizes, ALPS employs a modified PCG method to solve for the optimal weights. This method leverages the sparse matrix structure and the advantages of GPU parallel computing, significantly improving computational speed. - By introducing a new penalty parameter update scheme, ALPS can ensure convergence while finding a high-quality support set. - Experimental results show that under high sparsity conditions, ALPS achieves significant improvements in pruning targets and perplexity reduction compared to existing methods. In summary, the goal of ALPS is to achieve one-shot pruning of large language models through efficient optimization methods, thereby effectively reducing the computational cost and storage demands of the model while maintaining or enhancing model performance.

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Pruning Foundation Models for High Accuracy without Retraining

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

LLM-Pruner: On the Structural Pruning of Large Language Models

ALPINE: An adaptive language-agnostic pruning method for language models for code

MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration