AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

Zihao Zeng,Yibo Miao,Hongcheng Gao,Hao Zhang,Zhijie Deng
2024-10-14
Abstract:Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., "<EOS>" vs. "apple") may require various numbers of experts for feature abstraction. Lifting such a constraint can help make the most of limited resources and unleash the potential of the model for downstream tasks. In this sense, we introduce AdaMoE to realize token-adaptive routing for MoE, where different tokens are permitted to select a various number of experts. AdaMoE makes minimal modifications to the vanilla MoE with top-k routing -- it simply introduces a fixed number of null experts, which do not consume any FLOPs, to the expert set and increases the value of k. AdaMoE does not force each token to occupy a fixed number of null experts but ensures the average usage of the null experts with a load-balancing loss, leading to an adaptive number of null/true experts used by each token. AdaMoE exhibits a strong resemblance to MoEs with expert choice routing while allowing for trivial auto-regressive modeling. AdaMoE is easy to implement and can be effectively applied to pre-trained (MoE-)LLMs. Extensive studies show that AdaMoE can reduce average expert load (FLOPs) while achieving superior performance. For example, on the ARC-C dataset, applying our method to fine-tuning Mixtral-8x7B can reduce FLOPs by 14.5% while increasing accuracy by 1.69%.
Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is that existing Mixture of Experts (MoE) methods use a fixed top-k routing strategy when selecting experts. This results in all tokens choosing the same number of experts for feature abstraction, whereas in reality, different tokens may require different numbers of experts. This fixed routing strategy limits the effective utilization of resources and may not fully realize the model's potential. To solve this problem, the authors propose AdaMOE, which introduces null experts to achieve token-adaptive routing. Specifically, AdaMOE allows different tokens to select different numbers of experts, thereby more flexibly utilizing limited computational resources and improving the model's performance on downstream tasks. ### Main Contributions: 1. **Token-Adaptive Routing**: AdaMOE introduces null experts, allowing each token to select a different number of experts, thus achieving more flexible resource allocation. 2. **Improved Computational Efficiency**: Experimental results show that AdaMOE can reduce computational load (FLOPs) while improving model accuracy. 3. **Ease of Implementation**: AdaMOE makes minimal modifications to existing MoE methods and can be easily applied to pre-trained large language models (LLMs) and MoE-LLMs. ### Experimental Results: - **Performance on Multiple Datasets**: AdaMOE demonstrates higher accuracy on datasets such as RTE, COLA, SQA, CQA, and OQA. - **Optimization of Computational Resources**: On the Mixtral-8x7B model, AdaMOE reduces FLOPs by 14.5% while improving accuracy by 1.69%. - **Load Balancing**: By introducing null experts and adjusting load balancing loss, AdaMOE ensures the flexibility of token expert selection and efficient resource utilization. In summary, AdaMOE, through its token-adaptive routing mechanism, not only improves model performance but also significantly reduces computational costs, providing a new approach for optimizing large-scale language models.