Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Xudong Lu,Qi Liu,Yuhui Xu,Aojun Zhou,Siyuan Huang,Bo Zhang,Junchi Yan,Hongsheng Li

2024-05-31

Abstract:A pivotal advancement in the progress of large language models (LLMs) is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. Different from previous weight pruning methods that rely on specifically designed hardware, this paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques. Specifically, we propose, for the first time to our best knowledge, post-training approaches for task-agnostic and task-specific expert pruning and skipping of MoE LLMs, tailored to improve deployment efficiency while maintaining model performance across a wide range of tasks. Extensive experiments show that our proposed methods can simultaneously reduce model sizes and increase the inference speed, while maintaining satisfactory performance. Data and code will be available at <a class="link-external link-https" href="https://github.com/Lucky-Lance/Expert_Sparsity" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue of how to improve model deployment efficiency, reduce memory usage, and speed up inference in large-scale language models (LLMs), particularly in Mixture of Experts (MoE) models. Although MoE models have shown performance improvements over traditional large-scale language models, their massive parameter count still makes deployment challenging. The paper proposes a novel post-training method for task-agnostic and task-specific expert pruning and dynamic expert skipping to optimize the deployment efficiency of MoE models while maintaining model performance. Specifically, the main contributions of the paper include: 1. **Expert-level Sparsity Exploration**: Systematically studying expert-level sparsity in MoE LLMs and proposing, for the first time, a hardware-friendly post-training method for permanently removing unimportant experts (expert pruning) or dynamically skipping certain experts during inference (dynamic expert skipping). 2. **Task-agnostic and Task-specific Expert Pruning**: Proposing task-agnostic and task-specific expert pruning methods that select which experts to retain by minimizing inter-layer token reconstruction loss, thereby reducing model size while maintaining performance. 3. **Dynamic Expert Skipping**: Introducing an online method to dynamically skip certain experts during inference, further improving inference speed without significantly affecting model robustness. 4. **Experimental Validation**: Conducting extensive experiments on the Mixtral 8x7B model, showing that the proposed methods can significantly reduce memory usage and improve inference speed with minimal performance degradation. These methods not only enhance the deployment efficiency of MoE models but also provide new insights for optimizing large-scale language models.

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

Toward Inference-optimal Mixture-of-Expert Large Language Models

LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

Mixture of Diverse Size Experts

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Task-Specific Expert Pruning for Sparse Mixture-of-Experts

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

A Closer Look into Mixture-of-Experts in Large Language Models

MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning