Fast and Effective Weight Update for Pruned Large Language Models

Vladimír Boža
2024-07-22
Abstract:Pruning large language models (LLMs) is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations. In our paper, we propose a fast and effective weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). We further extend it with a simple gradual pruning mask selection and achieve state-of-the-art pruning performance across a wide range of LLMs. Code is available at <a class="link-external link-https" href="https://github.com/fmfi-compbio/admm-pruning" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform effective pruning and weight update in large - scale language models (LLMs). Specifically, the paper focuses on how to efficiently restore the model performance after pruning, especially the computational and memory challenges faced when dealing with large - scale models. Traditional pruning methods usually require a large amount of fine - tuning to recover the performance lost due to pruning, which is not feasible on large - scale language models because these models have huge computational and memory requirements. For example, some studies show that in order to restore performance, it may be necessary to retrain data of billions of tokens. To solve these problems, the paper proposes a fast and effective layer - weight update algorithm based on the alternating direction method of multipliers (ADMM). This algorithm can complete pruning and weight update in a single forward pass, thereby significantly reducing the computational overhead and achieving state - of - the - art pruning effects on a variety of LLMs. In addition, the paper also introduces a step - by - step pruning method, which further improves the pruning performance by gradually increasing the pruning ratio. In summary, the main contribution of this paper is to provide an efficient, low - overhead pruning and weight update method suitable for large - scale language models, which can effectively restore the model performance after pruning.