ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

Xiang Meng,Kayhan Behdin,Haoyue Wang,Rahul Mazumder
2024-08-04
Abstract:The impressive performance of Large Language Models (LLMs) across various natural language processing tasks comes at the cost of vast computational resources and storage requirements. One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Yet, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression. In this paper, we introduce ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. ALPS substantially outperforms state-of-the-art methods in terms of the pruning objective and perplexity reduction, particularly for highly sparse models. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses the issues of significant computational resource consumption and high storage demands in large language models (LLMs) by proposing a new framework called ALPS (ADMM-based LLM Pruning in one Shot). The aim is to reduce the number of model parameters through one-shot pruning techniques, thereby improving model efficiency. Specifically, ALPS addresses the following issues: 1. **Resource consumption of large-scale language models**: While LLMs perform excellently in natural language processing tasks, their large scale brings about high computational costs and storage demands. 2. **Application of one-shot pruning techniques**: Unlike traditional iterative pruning methods that require retraining, one-shot pruning techniques can remove redundant weights without additional training, thus reducing computational burden. 3. **Insufficient application of optimization techniques in pruning**: Due to the massive scale of LLMs, existing pruning methods often rely on heuristic algorithms rather than optimization techniques, which may lead to suboptimal compression results. To solve the above problems, ALPS proposes a framework based on the Operator Splitting Technique, specifically the Alternating Direction Method of Multipliers (ADMM), combined with a post-processing step using the Preconditioned Conjugate Gradient (PCG) method to achieve effective pruning of model weights. Specifically: - ALPS formulates the pruning problem as an optimization problem with `ℓ0` constraints and directly uses ADMM to solve this problem, simultaneously determining the support set of weights and updating these weights. - Once the support set stabilizes, ALPS employs a modified PCG method to solve for the optimal weights. This method leverages the sparse matrix structure and the advantages of GPU parallel computing, significantly improving computational speed. - By introducing a new penalty parameter update scheme, ALPS can ensure convergence while finding a high-quality support set. - Experimental results show that under high sparsity conditions, ALPS achieves significant improvements in pruning targets and perplexity reduction compared to existing methods. In summary, the goal of ALPS is to achieve one-shot pruning of large language models through efficient optimization methods, thereby effectively reducing the computational cost and storage demands of the model while maintaining or enhancing model performance.