Abstract:The considerable size of Large Language Models (LLMs) presents notable deployment challenges, particularly on resource-constrained hardware. Structured pruning, offers an effective means to compress LLMs, thereby reducing storage costs and enhancing inference speed for more efficient utilization. In this work, we study data-efficient and resource-efficient structure pruning methods to obtain smaller yet still powerful models. Knowledge Distillation is well-suited for pruning, as the intact model can serve as an excellent teacher for pruned students. However, it becomes challenging in the context of LLMs due to memory constraints. To address this, we propose an efficient progressive Numerous-teacher pruning method (NutePrune). NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly switch between teacher and student roles. This approach allows us to leverage numerous teachers with varying capacities to progressively guide the pruned model, enhancing overall performance. Extensive experiments across various tasks demonstrate the effectiveness of NutePrune. In LLaMA-7B zero-shot experiments, NutePrune retains 97.17% of the performance of the original model at 20% sparsity and 95.07% at 25% sparsity. Our code is available at <a class="link-external link-https" href="https://github.com/Lucius-lsr/NutePrune" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deployment and inference challenges of large - language models (LLMs) on resource - constrained hardware. Specifically: 1. **The scale problem of large - language models**: Due to their large number of parameters, large - language models face significant deployment difficulties in practical applications, especially in hardware environments with limited resources. 2. **The need for structured pruning**: In order to compress these large - language models, reduce storage costs and increase inference speed, structured pruning provides an effective method. However, how to achieve efficient structured pruning without significantly degrading performance is an urgent problem to be solved. 3. **The application challenges of knowledge distillation**: Knowledge Distillation (KD) is an effective method for training small models guided by large models. However, when dealing with large - language models, due to memory limitations, loading multiple teacher models becomes impractical. In addition, a single teacher model may not be able to fully transfer knowledge, especially when there is a large capacity gap between the teacher and student models. To solve these problems, the paper proposes an efficient progressive multi - teacher pruning method named **NutePrune**. NutePrune addresses the above challenges in the following ways: - **Progressive multi - teacher pruning**: By introducing multiple teacher models with different sparsities, the pruned student model is gradually guided, thereby narrowing the capacity gap between the teacher and the student. - **Efficient memory utilization**: NutePrune only loads one complete model and switches the teacher and student roles through different masks and LoRA modules, avoiding the huge memory overhead caused by loading multiple teacher models. - **Performance optimization**: The experimental results show that NutePrune retains 97.17% of the original model performance at 20% sparsity and 95.07% of the performance at 25% sparsity on the LLaMA - 7B model. Through these innovations, NutePrune not only effectively solves the pruning problem of large - language models, but also achieves efficient model compression and performance retention in resource - constrained environments.

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

LLM-Pruner: On the Structural Pruning of Large Language Models

DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

Pruning Foundation Models for High Accuracy without Retraining

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

Adaptive Pruning for Large Language Models with Structural Importance Awareness

SparseLLM: Towards Global Pruning for Pre-trained Language Models

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

PAT: Pruning-Aware Tuning for Large Language Models

A Simple and Effective Pruning Approach for Large Language Models

SlimGPT: Layer-wise Structured Pruning for Large Language Models

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Large Language Model Pruning

Pruning as a Domain-specific LLM Extractor

Reassessing Layer Pruning in LLMs: New Insights and Methods