Fast and Effective Weight Update for Pruned Large Language Models

Vladimír Boža

2024-07-22

Abstract:Pruning large language models (LLMs) is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations. In our paper, we propose a fast and effective weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). We further extend it with a simple gradual pruning mask selection and achieve state-of-the-art pruning performance across a wide range of LLMs. Code is available at <a class="link-external link-https" href="https://github.com/fmfi-compbio/admm-pruning" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to perform effective pruning and weight update in large - scale language models (LLMs). Specifically, the paper focuses on how to efficiently restore the model performance after pruning, especially the computational and memory challenges faced when dealing with large - scale models. Traditional pruning methods usually require a large amount of fine - tuning to recover the performance lost due to pruning, which is not feasible on large - scale language models because these models have huge computational and memory requirements. For example, some studies show that in order to restore performance, it may be necessary to retrain data of billions of tokens. To solve these problems, the paper proposes a fast and effective layer - weight update algorithm based on the alternating direction method of multipliers (ADMM). This algorithm can complete pruning and weight update in a single forward pass, thereby significantly reducing the computational overhead and achieving state - of - the - art pruning effects on a variety of LLMs. In addition, the paper also introduces a step - by - step pruning method, which further improves the pruning performance by gradually increasing the pruning ratio. In summary, the main contribution of this paper is to provide an efficient, low - overhead pruning and weight update method suitable for large - scale language models, which can effectively restore the model performance after pruning.

Fast and Effective Weight Update for Pruned Large Language Models

A Simple and Effective Pruning Approach for Large Language Models

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Reassessing Layer Pruning in LLMs: New Insights and Methods

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

Pruning Foundation Models for High Accuracy without Retraining

Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment

MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

A Systematic DNN Weight Pruning Framework Using Alternating Direction Method of Multipliers

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

LLM-Pruner: On the Structural Pruning of Large Language Models

Fluctuation-based Adaptive Structured Pruning for Large Language Models

Adaptive Pruning for Large Language Models with Structural Importance Awareness

ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models

Pruning Large Language Models via Accuracy Predictor

Pruning as a Domain-specific LLM Extractor

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models