CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information

Yuxin Wang,Minghua Ma,Zekun Wang,Jingchang Chen,Huiming Fan,Liping Shan,Qing Yang,Dongliang Xu,Ming Liu,Bing Qin
2024-09-20
Abstract:The colossal parameters and computational overhead of Large Language Models (LLMs) challenge their real-world applications. Network pruning, which targets unstructured or structured sparsity by removing redundant parameters, has recently been explored for LLM acceleration. Existing LLM pruning works focus on unstructured pruning, which typically requires special hardware support for a practical speed-up. In contrast, structured pruning can reduce latency on general devices. However, it remains a challenge to perform structured pruning efficiently and maintain performance, especially at high sparsity ratios. To this end, we introduce an efficient structured pruning framework named CFSP, which leverages both Coarse (interblock) and Fine-grained (intrablock) activation information as an importance criterion to guide pruning. The pruning is highly efficient, as it only requires one forward pass to compute feature activations. Specifically, we first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block. In addition, we introduce a recovery fine-tuning strategy that adaptively allocates training overhead based on coarse-grained importance to further improve performance. Experimental results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets. Our code will be available at <a class="link-external link-https" href="https://github.com/wyxscir/CFSP" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deployment challenges faced by large - scale language models (LLMs) in practical applications due to their large number of parameters and computational overhead. Specifically, in response to the problems existing in the existing LLM pruning methods, the paper proposes an efficient structured pruning framework - CFSP (Coarse - to - Fine Structured Pruning), aiming to reduce the number of model parameters and computational cost while maintaining performance by using coarse - grained to fine - grained activation information to guide pruning. This method not only improves the pruning efficiency but also maintains good model performance at high sparsity, and is especially suitable for the acceleration requirements on general - purpose devices. ### Main contributions: 1. **Efficient coarse - to - fine importance criteria**: Proposed a coarse - to - fine importance criterion for identifying redundant structures for pruning, and the whole process can be completed in just a few minutes. 2. **Importance - based recovery fine - tuning strategy**: Introduced a new recovery fine - tuning method that adaptively allocates additional trainable parameters according to the coarse - grained importance scores, enabling the pruned model to achieve similar performance with less recovery data. 3. **Extensive experimental verification**: Experimental results show that CFSP outperforms existing methods on different models and at different sparsity levels, especially outstanding at high sparsity, demonstrating its potential on complex tasks. ### Core technologies of the solution: - **Coarse - grained importance**: Measure the transformation saliency of a block by calculating the angular distance between the input and output feature activations of each block as the basis for allocating the sparsity budget. - **Fine - grained importance**: Inside each block, combine the product of the relative activation value and the weight as a fine - grained criterion to remove redundant parts. - **Dimension adjustment**: In order to ensure parallelism on the GPU, adjust the final dimension of the pruned block to be a multiple of 128. - **Importance - guided recovery fine - tuning**: After pruning, further improve the model performance by adaptively allocating additional trainable parameters. Through these technologies, CFSP not only improves the pruning efficiency of the model but also significantly reduces the number of model parameters and computational cost while maintaining high performance, and is especially suitable for deploying large - language models in resource - limited environments.