Abstract:The colossal parameters and computational overhead of Large Language Models (LLMs) challenge their real-world applications. Network pruning, which targets unstructured or structured sparsity by removing redundant parameters, has recently been explored for LLM acceleration. Existing LLM pruning works focus on unstructured pruning, which typically requires special hardware support for a practical speed-up. In contrast, structured pruning can reduce latency on general devices. However, it remains a challenge to perform structured pruning efficiently and maintain performance, especially at high sparsity ratios. To this end, we introduce an efficient structured pruning framework named CFSP, which leverages both Coarse (interblock) and Fine-grained (intrablock) activation information as an importance criterion to guide pruning. The pruning is highly efficient, as it only requires one forward pass to compute feature activations. Specifically, we first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block. In addition, we introduce a recovery fine-tuning strategy that adaptively allocates training overhead based on coarse-grained importance to further improve performance. Experimental results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets. Our code will be available at <a class="link-external link-https" href="https://github.com/wyxscir/CFSP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deployment challenges faced by large - scale language models (LLMs) in practical applications due to their large number of parameters and computational overhead. Specifically, in response to the problems existing in the existing LLM pruning methods, the paper proposes an efficient structured pruning framework - CFSP (Coarse - to - Fine Structured Pruning), aiming to reduce the number of model parameters and computational cost while maintaining performance by using coarse - grained to fine - grained activation information to guide pruning. This method not only improves the pruning efficiency but also maintains good model performance at high sparsity, and is especially suitable for the acceleration requirements on general - purpose devices. ### Main contributions: 1. **Efficient coarse - to - fine importance criteria**: Proposed a coarse - to - fine importance criterion for identifying redundant structures for pruning, and the whole process can be completed in just a few minutes. 2. **Importance - based recovery fine - tuning strategy**: Introduced a new recovery fine - tuning method that adaptively allocates additional trainable parameters according to the coarse - grained importance scores, enabling the pruned model to achieve similar performance with less recovery data. 3. **Extensive experimental verification**: Experimental results show that CFSP outperforms existing methods on different models and at different sparsity levels, especially outstanding at high sparsity, demonstrating its potential on complex tasks. ### Core technologies of the solution: - **Coarse - grained importance**: Measure the transformation saliency of a block by calculating the angular distance between the input and output feature activations of each block as the basis for allocating the sparsity budget. - **Fine - grained importance**: Inside each block, combine the product of the relative activation value and the weight as a fine - grained criterion to remove redundant parts. - **Dimension adjustment**: In order to ensure parallelism on the GPU, adjust the final dimension of the pruned block to be a multiple of 128. - **Importance - guided recovery fine - tuning**: After pruning, further improve the model performance by adaptively allocating additional trainable parameters. Through these technologies, CFSP not only improves the pruning efficiency of the model but also significantly reduces the number of model parameters and computational cost while maintaining high performance, and is especially suitable for deploying large - language models in resource - limited environments.

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information

Fluctuation-based Adaptive Structured Pruning for Large Language Models

Pruning Foundation Models for High Accuracy without Retraining

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing

CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

PAT: Pruning-Aware Tuning for Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

SlimGPT: Layer-wise Structured Pruning for Large Language Models

ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

Layer-adaptive Structured Pruning Guided by Latency

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

FedSpaLLM: Federated Pruning of Large Language Models

LLM-Pruner: On the Structural Pruning of Large Language Models

Less is More: Towards Green Code Large Language Models via Unified Structural Pruning

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

BlockPruner: Fine-grained Pruning for Large Language Models

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations