Abstract:Mixture-of-Experts (MOE) has garnered significant attention for their ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not relieve the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable. In this paper, we propose Condense-MoE (CD-MoE} that, instead of dropping the entire MoE layer, condenses the big, sparse MoE layer into a small but dense layer with only a few experts that are activated for all tokens. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts, with certain experts isolated to serve as shared experts that are always activated. We demonstrate the effectiveness of our method across multiple MoE models such as DeepSeekMoE and QwenMoE on various benchmarks. Specifically, for the DeepSeekMoE-16B model, our approach maintains nearly 90% of the average accuracy while reducing memory usage by 30% and enhancing inference speed by 30%. Moreover, we show that with lightweight expert fine-tuning, the pruned model can achieve further improvements on specific tasks. Our code are available at <a class="link-external link-https" href="https://github.com/duterscmy/CD-MoE/tree/main" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of excessive memory occupation faced by the Mixture of Experts (MoE) model in practical applications, especially in the context of large - scale language models (LLMs). Although the MoE architecture can expand the model capacity by selectively activating expert paths without increasing the activation of all parameters, these models still require a large amount of memory to store numerous experts. This limits their deployment capabilities in the real world, especially for large - scale language models. To address this challenge, existing research attempts to reduce memory requirements by deleting the entire MoE layer, but this will lead to a significant performance degradation. Therefore, this paper proposes a new method - Condense - MoE (CD - MoE). This method does not simply delete the entire MoE layer, but compresses a large, sparse MoE layer into a small, dense layer, retaining only a few experts, and these experts are active for all tokens. This method is especially suitable for fine - grained MoE models, in which the feed - forward network is split into many small experts, and some experts are isolated as shared experts and are always in an active state. ### Main contributions 1. **Propose the CD - MoE framework**: By selectively eliminating unimportant experts and re - allocating all tokens to a few remaining experts, effectively compress a large, sparse MoE layer into a small, dense layer, thereby significantly improving the inference efficiency. 2. **Experimental verification**: Extensive experiments have been carried out on multiple MoE models (such as DeepSeekMoE and QwenMoE). The results show that the CD - MoE method can reduce memory usage by 30% while maintaining an average accuracy of nearly 90% and increasing the inference speed by 30%. 3. **Light - weight fine - tuning**: Through light - weight expert fine - tuning, the performance of the compressed model on specific tasks is further improved. ### Method overview 1. **Expert selection and compression**: - Remove the routing mechanism and calculate the gate value of each expert. - Select the most critical experts through the greedy search algorithm to ensure that the output of the compressed layer is as close as possible to the output of the original layer. 2. **Layer selection and compression**: - Use Jensen - Shannon (JS) divergence to evaluate the output changes of different layers before and after compression. - Adopt a greedy search strategy to select the layer with the least impact on the model output for compression. 3. **Experimental setup and evaluation**: - Use the DeepSeekMoE - 16B model for experiments. - Evaluate zero - sample accuracy, inference acceleration ratio and memory usage rate. - Conduct light - weight fine - tuning to further improve the model performance. ### Experimental results - **Zero - sample tasks**: On multiple benchmark datasets, the CD - MoE method is always superior to the Block Trimming and Layer Trimming methods under similar acceleration ratios or memory usage conditions. - **Fine - tuning results**: Through language modeling and supervised fine - tuning, the CD - MoE (E2 + 6) configuration shows significant advantages on multiple tasks. ### Conclusion The CD - MoE method proposed in this paper effectively solves the problem of excessive memory occupation of the MoE model in practical applications while maintaining the high performance of the model. By selectively compressing key experts and layers, this method performs well in reducing memory usage and improving inference efficiency.

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy

Demystifying the Compression of Mixture-of-Experts Through a Unified Framework

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training

MoEC: Mixture of Expert Clusters

SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models Via Dynamic Expert Pruning and Swapping

Task-Specific Expert Pruning for Sparse Mixture-of-Experts

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

Llama 3 Meets MoE: Efficient Upcycling

Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design