Abstract:Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs), offering unprecedented computational efficiency. However, these architectures grapple with challenges of token distribution imbalance and expert homogenization, impeding optimal semantic generalization. We introduce a novel framework that redefines MoE routing through affinity-driven active selection. The innovations for the framework encompass: (1) A rigorous formulation of expert-token affinity metrics. (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens. (3) Theoretical derivation and experimental evidence of reduced expert capacity bounds under dynamic token distribution evolution. It is also integrated with orthogonal feature extraction module and an optimized loss function for expert localization. Our theoretical analysis demonstrates that this approach mitigates expert homogenization while enabling substantial capacity boundary reduction. Experimental validation corroborates these findings: it achieves a 40% reduction in token processed by each expert without compromising model convergence or efficacy. When coupled with communication optimizations, the training efficiency improvements of 5.4% to 46.6% can be observed. After supervised fine-tuning, it exhibits performance gains of 9.7% to 14.1% across GDAD, C-Eval, and TeleQnA benchmarks.

What problem does this paper attempt to address?

The paper attempts to address two main challenges in the Mixture-of-Experts (MoE) architecture within large-scale language models (LLMs): **uneven token distribution** and **expert homogenization**. These challenges hinder the model's optimal generalization ability in terms of semantics. Specifically: 1. **Uneven token distribution**: In traditional MoE models, some experts may become overloaded due to being assigned too many tokens, while other experts may not receive sufficient training. This "winner-takes-all" phenomenon can lead to a decline in model performance. 2. **Expert homogenization**: Due to the limitations of token allocation strategies, different experts may learn similar features, thereby reducing the model's diversity and generalization ability. To address these issues, the paper proposes a new framework—**Affinity-Driven Active Selection Mechanism (Expert-Token Resonance)**. The main innovations of this framework include: 1. **Strict expert-token affinity measurement formula**: By calculating the cosine similarity between tokens and experts to define the affinity score, it better guides the router to focus on different types of tokens, reducing the tendency towards homogenization. 2. **Bidirectional expert-token selection mechanism**: Combining the Expert Choice Router (ECR) and Token Choice Router (TCR), it allows each expert to select the most suitable tokens for processing based on the affinity score, thereby increasing the success rate of training. 3. **Adaptive expert capacity boundary**: By setting an adaptive affinity threshold, the lower limit of expert capacity can be significantly reduced. As the number of training iterations increases, the information density of token features gradually increases, causing the expert capacity to first decrease and then stabilize, ultimately greatly improving the training efficiency of MoE. Experimental results show that this method achieves significant performance improvements in multiple benchmarks, such as GDAD, C-Eval, and TeleQnA, with performance increases ranging from 9.7% to 14.1%. Additionally, this method can also improve training efficiency on Ascend clusters of different scales, with a maximum improvement of 46.6%.

Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts

HMoE: Heterogeneous Mixture of Experts for Language Modeling

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

Ada-K Routing: Boosting the Efficiency of MoE-based LLMs

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models

Multi-Head Mixture-of-Experts

Mixture of Diverse Size Experts

MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

Harder Tasks Need More Experts: Dynamic Routing in MoE Models

LocMoE: A Low-Overhead MoE for Large Language Model Training

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

An Expert is Worth One Token: Synergizing Multiple Expert LLMs as Generalist via Expert Token Routing