Yuan 2.0-M32: Mixture of Experts with Attention Router

Shaohua Wu,Jiangang Luo,Xi Chen,Lingjun Li,Xudong Zhao,Tong Yu,Chao Wang,Yue Wang,Fei Wang,Weixu Qiao,Houbo He,Zeru Zhang,Zeyu Sun,Junxiong Mao,Chong Shen

2024-05-29

Abstract:Yuan 2.0-M32, with a similar base architecture as Yuan-2.0 2B, uses a mixture-of-experts architecture with 32 experts of which 2 experts are active. A new router network, Attention Router, is proposed and adopted for a more efficient selection of experts, which improves the accuracy compared to the model with classical router network. Yuan 2.0-M32 is trained with 2000B tokens from scratch, and the training computation consumption is only 9.25% of a dense model at the same parameter scale. Yuan 2.0-M32 demonstrates competitive capability on coding, math, and various domains of expertise, with only 3.7B active parameters of 40B in total, and 7.4 GFlops forward computation per token, both of which are only 1/19 of Llama3-70B. Yuan 2.0-M32 surpass Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8 respectively. The models and source codes of Yuan 2.0-M32 are released at Github1.

Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

This paper introduces Yuan 2.0-M32, a large-scale language model based on Yuan 2.0 with a Mixture of Experts (MoE) architecture. Unlike traditional MoE structures, Yuan 2.0-M32 introduces a new router network called Attention Router, which takes into account the relevance between experts and improves the model's accuracy. During training, Yuan 2.0-M32 consumes only 9.25% of the computational resources of an equivalently parameterized dense model. Despite having a total of 40 billion parameters, only 3.7 billion parameters are active, and the forward computation GFlops per token is 7.4, which is approximately 1/19 of Llama3-70B. In multi-task testing, Yuan 2.0-M32 performs well, achieving accuracies of 55.89 and 95.8 on the MATH and ARC-Challenge benchmarks, respectively, surpassing Llama3-70B. The paper also compares different router structures, demonstrating the superiority of the attention router in terms of accuracy, and discusses comparisons with other MoE models such as Gshard, Switch Transformer, and Expert Choice algorithm. Furthermore, the paper provides detailed descriptions of the model architecture, training strategy, datasets, and preprocessing methods. Yuan 2.0-M32 exhibits efficient performance and accuracy in code generation, mathematical problem solving, as well as comprehensive benchmarks MMLU and scientific reasoning tasks ARC. Especially in cases of lower computational efficiency and parameter activity, it achieves comparable or even superior results compared to larger-scale models. The model and source code have been open-sourced for the research community to use.

Yuan 2.0-M32: Mixture of Experts with Attention Router

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

Layerwise Recurrent Router for Mixture-of-Experts

Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection

Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

Exploring Sparse Expert Models and Beyond

Mixture of Experts for Network Optimization: A Large Language Model-enabled Approach

Towards More Effective and Economic Sparsely-Activated Model

Composition of Experts: A Modular Compound AI System Leveraging Large Language Models

Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning

ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Harder Tasks Need More Experts: Dynamic Routing in MoE Models

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning

HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts