Abstract:Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important -- only a small subset is critical. Building on these insights, we propose MC-MoE, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and loading overheads, we introduce Pre-Loading Mixed-Precision Quantization, which formulates the adaptive bit-width allocation as a Linear Programming problem, where the objective function balances multi-factors reflecting the importance of each expert. Additionally, we develop Online Dynamic Pruning, which identifies important tokens to retain and dynamically select activated experts for other tokens during inference to optimize efficiency while maintaining performance. Our MC-MoE integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency. Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.

MoEfication: Conditional Computation of Transformer Models for Efficient Inference

MoEfication: Transformer Feed-forward Layers Are Mixtures of Experts

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Mixture of A Million Experts

Toward Inference-optimal Mixture-of-Expert Large Language Models

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Approximating Two-Layer Feedforward Networks for Efficient Transformers

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

LaDiMo: Layer-wise Distillation Inspired MoEfier

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

Merging Multi-Task Models via Weight-Ensembling Mixture of Experts