Abstract:Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important -- only a small subset is critical. Building on these insights, we propose MC-MoE, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and loading overheads, we introduce Pre-Loading Mixed-Precision Quantization, which formulates the adaptive bit-width allocation as a Linear Programming problem, where the objective function balances multi-factors reflecting the importance of each expert. Additionally, we develop Online Dynamic Pruning, which identifies important tokens to retain and dynamically select activated experts for other tokens during inference to optimize efficiency while maintaining performance. Our MC-MoE integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency. Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.

Q-MoE: Connector for MLLMs with Text-Driven Routing

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

Dense Connector for MLLMs

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

LocMoE: A Low-Overhead MoE for Large Language Model Training

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models

MoExtend: Tuning New Experts for Modality and Task Extension

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach

WDMoE: Wireless Distributed Large Language Models with Mixture of Experts

HMoE: Heterogeneous Mixture of Experts for Language Modeling

To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Multi-modal Intent Detection with LVAMoE: the Language-Visual-Audio Mixture of Experts

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration