Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection

Jing Li,Zhijie Sun,Dachao Lin,Xuan He,Yi Lin,Binfan Zheng,Li Zeng,Rongqian Zhao,Xin Chen
2024-08-30
Abstract:Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs), offering unprecedented computational efficiency. However, these architectures grapple with challenges of token distribution imbalance and expert homogenization, impeding optimal semantic generalization. We introduce a novel framework that redefines MoE routing through affinity-driven active selection. The innovations for the framework encompass: (1) A rigorous formulation of expert-token affinity metrics. (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens. (3) Theoretical derivation and experimental evidence of reduced expert capacity bounds under dynamic token distribution evolution. It is also integrated with orthogonal feature extraction module and an optimized loss function for expert localization. Our theoretical analysis demonstrates that this approach mitigates expert homogenization while enabling substantial capacity boundary reduction. Experimental validation corroborates these findings: it achieves a 40% reduction in token processed by each expert without compromising model convergence or efficacy. When coupled with communication optimizations, the training efficiency improvements of 5.4% to 46.6% can be observed. After supervised fine-tuning, it exhibits performance gains of 9.7% to 14.1% across GDAD, C-Eval, and TeleQnA benchmarks.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address two main challenges in the Mixture-of-Experts (MoE) architecture within large-scale language models (LLMs): **uneven token distribution** and **expert homogenization**. These challenges hinder the model's optimal generalization ability in terms of semantics. Specifically: 1. **Uneven token distribution**: In traditional MoE models, some experts may become overloaded due to being assigned too many tokens, while other experts may not receive sufficient training. This "winner-takes-all" phenomenon can lead to a decline in model performance. 2. **Expert homogenization**: Due to the limitations of token allocation strategies, different experts may learn similar features, thereby reducing the model's diversity and generalization ability. To address these issues, the paper proposes a new framework—**Affinity-Driven Active Selection Mechanism (Expert-Token Resonance)**. The main innovations of this framework include: 1. **Strict expert-token affinity measurement formula**: By calculating the cosine similarity between tokens and experts to define the affinity score, it better guides the router to focus on different types of tokens, reducing the tendency towards homogenization. 2. **Bidirectional expert-token selection mechanism**: Combining the Expert Choice Router (ECR) and Token Choice Router (TCR), it allows each expert to select the most suitable tokens for processing based on the affinity score, thereby increasing the success rate of training. 3. **Adaptive expert capacity boundary**: By setting an adaptive affinity threshold, the lower limit of expert capacity can be significantly reduced. As the number of training iterations increases, the information density of token features gradually increases, causing the expert capacity to first decrease and then stabilize, ultimately greatly improving the training efficiency of MoE. Experimental results show that this method achieves significant performance improvements in multiple benchmarks, such as GDAD, C-Eval, and TeleQnA, with performance increases ranging from 9.7% to 14.1%. Additionally, this method can also improve training efficiency on Ascend clusters of different scales, with a maximum improvement of 46.6%.