Abstract:Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, including language modeling, understanding, and generation. However, the increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices. Structural pruning has emerged as a promising solution to reduce the costs of LLMs without requiring post-processing steps. Prior structural pruning methods either follow the dependence of structures at the cost of limiting flexibility, or introduce non-trivial additional parameters by incorporating different projection matrices. In this work, we propose a novel approach that relaxes the constraint imposed by regular structural pruning methods and eliminates the structural dependence along the embedding dimension. Our dimension-independent structural pruning method offers several benefits. Firstly, our method enables different blocks to utilize different subsets of the feature maps. Secondly, by removing structural dependence, we facilitate each block to possess varying widths along its input and output dimensions, thereby significantly enhancing the flexibility of structural pruning. We evaluate our method on various LLMs, including OPT, LLaMA, LLaMA-2, Phi-1.5, and Phi-2. Experimental results demonstrate that our approach outperforms other state-of-the-art methods, showing for the first time that structural pruning can achieve an accuracy similar to semi-structural pruning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the high memory and computational cost issues faced by large language models (LLMs) when deployed on resource - constrained devices. Specifically, the paper proposes a new structural pruning method - Dimension - Independent Structural Pruning (DISP - LLM), aiming to reduce the computational and memory requirements of LLMs while maintaining model performance. ### Main Problems 1. **High Computational and Memory Costs**: Due to the large number of parameters in LLMs, it is difficult to deploy them on resource - constrained devices (such as mobile phones). 2. **Limitations of Existing Structural Pruning Methods**: - **Structure - Dependent**: Existing structural pruning methods usually need to follow structural dependence, which limits the flexibility of pruning. - **Introducing Extra Parameters**: Some methods increase flexibility by introducing different projection matrices, but these methods will introduce extra parameters and increase the complexity of the model. ### Solutions The paper proposes a new structural pruning method - DISP - LLM, and its main contributions are as follows: 1. **Breaking Structure Dependence**: By selecting different subsets in the embedding dimension, different layers are allowed to use different subsets of feature maps, thereby significantly improving the flexibility of pruning. 2. **No Need to Introduce Extra Parameters**: Unlike methods such as SliceGPT, DISP - LLM does not introduce extra parameters while breaking structure dependence. 3. **Learning the Width of Each Layer**: The width of each layer is learned through the gradient optimization method, further improving the flexibility of pruning. 4. **Efficient Optimization**: The dimension - independent structural pruning problem is formulated as an optimization problem, and the number of remaining parameters is controlled by a regularization term. ### Experimental Results The experimental results show that DISP - LLM outperforms other state - of - the - art structural pruning methods on multiple LLMs (including OPT, LLaMA, LLaMA - 2, Phi - 1.5 and Phi - 2), and also performs better than other methods on zero - shot tasks. ### Specific Experimental Data - **Language Modeling Tasks**: On the WikiText - 2 dataset, the perplexity of DISP - LLM at different pruning ratios is better than that of other methods. Especially at a 50% pruning ratio, the performance of the LLaMA - 2 7B and 13B models is 5.54 and 2.22 higher than that of LLM Surgeon respectively. - **Zero - Shot Tasks**: On zero - shot tasks such as PIQA, WinoGrande, HellaSwag, ARC - e and ARC - c, the performance of DISP - LLM is also better than that of other methods. Especially at a 50% pruning ratio, the average accuracy of the LLaMA - 2 7B and 13B models is 58.10 and 51.05 respectively. ### Conclusion DISP - LLM significantly improves the flexibility and performance of structural pruning by breaking structure dependence and learning the width of each layer, providing an effective method for efficiently deploying LLMs on resource - constrained devices.

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

LLM-Pruner: On the Structural Pruning of Large Language Models

LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

Adaptive Pruning for Large Language Models with Structural Importance Awareness

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Fluctuation-based Adaptive Structured Pruning for Large Language Models

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

SparseLLM: Towards Global Pruning for Pre-trained Language Models

SlimGPT: Layer-wise Structured Pruning for Large Language Models

Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment

BlockPruner: Fine-grained Pruning for Large Language Models

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Structured Optimal Brain Pruning for Large Language Models

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

Pruning Foundation Models for High Accuracy without Retraining

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

Large Language Model Pruning

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models