NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

Shengrui Li,Junzhe Chen,Xueting Han,Jing Bai
2024-06-27
Abstract:The considerable size of Large Language Models (LLMs) presents notable deployment challenges, particularly on resource-constrained hardware. Structured pruning, offers an effective means to compress LLMs, thereby reducing storage costs and enhancing inference speed for more efficient utilization. In this work, we study data-efficient and resource-efficient structure pruning methods to obtain smaller yet still powerful models. Knowledge Distillation is well-suited for pruning, as the intact model can serve as an excellent teacher for pruned students. However, it becomes challenging in the context of LLMs due to memory constraints. To address this, we propose an efficient progressive Numerous-teacher pruning method (NutePrune). NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly switch between teacher and student roles. This approach allows us to leverage numerous teachers with varying capacities to progressively guide the pruned model, enhancing overall performance. Extensive experiments across various tasks demonstrate the effectiveness of NutePrune. In LLaMA-7B zero-shot experiments, NutePrune retains 97.17% of the performance of the original model at 20% sparsity and 95.07% at 25% sparsity. Our code is available at <a class="link-external link-https" href="https://github.com/Lucius-lsr/NutePrune" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deployment and inference challenges of large - language models (LLMs) on resource - constrained hardware. Specifically: 1. **The scale problem of large - language models**: Due to their large number of parameters, large - language models face significant deployment difficulties in practical applications, especially in hardware environments with limited resources. 2. **The need for structured pruning**: In order to compress these large - language models, reduce storage costs and increase inference speed, structured pruning provides an effective method. However, how to achieve efficient structured pruning without significantly degrading performance is an urgent problem to be solved. 3. **The application challenges of knowledge distillation**: Knowledge Distillation (KD) is an effective method for training small models guided by large models. However, when dealing with large - language models, due to memory limitations, loading multiple teacher models becomes impractical. In addition, a single teacher model may not be able to fully transfer knowledge, especially when there is a large capacity gap between the teacher and student models. To solve these problems, the paper proposes an efficient progressive multi - teacher pruning method named **NutePrune**. NutePrune addresses the above challenges in the following ways: - **Progressive multi - teacher pruning**: By introducing multiple teacher models with different sparsities, the pruned student model is gradually guided, thereby narrowing the capacity gap between the teacher and the student. - **Efficient memory utilization**: NutePrune only loads one complete model and switches the teacher and student roles through different masks and LoRA modules, avoiding the huge memory overhead caused by loading multiple teacher models. - **Performance optimization**: The experimental results show that NutePrune retains 97.17% of the original model performance at 20% sparsity and 95.07% of the performance at 25% sparsity on the LLaMA - 7B model. Through these innovations, NutePrune not only effectively solves the pruning problem of large - language models, but also achieves efficient model compression and performance retention in resource - constrained environments.