Abstract:As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede the utilization of gradients in guiding LLM pruning. As a result, most pruning strategies for LLMs rely on gradient-free criteria, such as weight magnitudes or a mix of magnitudes and activations. In this paper, we devise a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient to capitalize on feature map sensitivity for pruning LLMs. To overcome memory requirement barriers, we estimate gradients using only forward passes. Based on this, we propose a Memory-effIcieNt structured prunIng procedure for LLMs (MINI-LLM) to remove no-critical channels and multi-attention heads. Experimental results demonstrate the superior performance of MINI-LLM over existing gradient-free methods on three LLMs: LLaMA, BLOOM, and OPT across various downstream tasks (classification, multiple-choice, and generation), while MINI-LLM maintains a GPU memory footprint akin to gradient-free methods.
What problem does this paper attempt to address?
### Problems Addressed by the Paper
The paper "MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models" aims to address the memory efficiency issues faced during the compression and acceleration of large language models (LLMs). Specifically, the paper proposes a new structured pruning method called MINI-LLM, which estimates gradients using only forward propagation, thereby achieving effective pruning of LLMs without significantly increasing memory consumption.
#### Background and Motivation
1. **Scale and Computational Cost of LLMs**:
- Large language models like GPT-4 and LLaMA perform exceptionally well on various complex natural language processing tasks, but their enormous model size leads to significant storage, memory, and computation time overheads, posing substantial challenges during training and deployment.
2. **Existing Pruning Methods**:
- Traditional neural network pruning methods typically rely on gradients to assess the importance of weights. However, in LLMs, due to the vast number of parameters, backpropagation to compute gradients requires a large amount of memory resources. Therefore, most pruning strategies for LLMs rely on gradient-free methods, such as weight magnitude or a combination of weight magnitude and activation values.
3. **Core Problem**:
- How to utilize gradient information to guide the pruning of LLMs without significantly increasing memory consumption, thereby improving pruning effectiveness and model performance.
#### Solution
1. **New Pruning Criterion**:
- The paper designs a new pruning criterion called Feature Map Sensitivity (FMS) score, which combines weight magnitude, activation values, and gradients. This criterion leverages these three key aspects to more finely assess the sensitivity of feature maps and provide effective scoring.
2. **Structured Pruning Framework**:
- A structured pruning framework named MINI-LLM is proposed. This framework estimates gradients using only forward propagation, significantly improving GPU memory efficiency while maintaining GPU memory usage comparable to gradient-free methods.
3. **Experimental Validation**:
- Experimental results show that MINI-LLM outperforms existing gradient-free methods in various downstream tasks (classification, multiple-choice, and generation) across three different types of LLMs (LLaMA, BLOOM, and OPT). In some cases, it even surpasses methods based on backpropagation gradients, while its GPU memory usage remains comparable to gradient-free methods.
### Summary
By designing a new pruning criterion and structured pruning framework, the paper successfully addresses the memory efficiency issues faced during the pruning of LLMs, providing an effective solution for the compression and acceleration of large language models.