Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Xudong Lu,Qi Liu,Yuhui Xu,Aojun Zhou,Siyuan Huang,Bo Zhang,Junchi Yan,Hongsheng Li
2024-05-31
Abstract:A pivotal advancement in the progress of large language models (LLMs) is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. Different from previous weight pruning methods that rely on specifically designed hardware, this paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques. Specifically, we propose, for the first time to our best knowledge, post-training approaches for task-agnostic and task-specific expert pruning and skipping of MoE LLMs, tailored to improve deployment efficiency while maintaining model performance across a wide range of tasks. Extensive experiments show that our proposed methods can simultaneously reduce model sizes and increase the inference speed, while maintaining satisfactory performance. Data and code will be available at <a class="link-external link-https" href="https://github.com/Lucky-Lance/Expert_Sparsity" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of how to improve model deployment efficiency, reduce memory usage, and speed up inference in large-scale language models (LLMs), particularly in Mixture of Experts (MoE) models. Although MoE models have shown performance improvements over traditional large-scale language models, their massive parameter count still makes deployment challenging. The paper proposes a novel post-training method for task-agnostic and task-specific expert pruning and dynamic expert skipping to optimize the deployment efficiency of MoE models while maintaining model performance. Specifically, the main contributions of the paper include: 1. **Expert-level Sparsity Exploration**: Systematically studying expert-level sparsity in MoE LLMs and proposing, for the first time, a hardware-friendly post-training method for permanently removing unimportant experts (expert pruning) or dynamically skipping certain experts during inference (dynamic expert skipping). 2. **Task-agnostic and Task-specific Expert Pruning**: Proposing task-agnostic and task-specific expert pruning methods that select which experts to retain by minimizing inter-layer token reconstruction loss, thereby reducing model size while maintaining performance. 3. **Dynamic Expert Skipping**: Introducing an online method to dynamically skip certain experts during inference, further improving inference speed without significantly affecting model robustness. 4. **Experimental Validation**: Conducting extensive experiments on the Mixtral 8x7B model, showing that the proposed methods can significantly reduce memory usage and improve inference speed with minimal performance degradation. These methods not only enhance the deployment efficiency of MoE models but also provide new insights for optimizing large-scale language models.