Abstract:Mixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the improvement in performance diminishes with an increasing number of experts. We hypothesize this parameter inefficiency is a result of all experts having equal capacity, which may not adequately meet the varying complexity requirements of different tokens or tasks. In light of this, we propose Stratified Mixture of Experts (SMoE) models, which feature a stratified structure and can assign dynamic capacity to different tokens. We demonstrate the effectiveness of SMoE on three multilingual machine translation benchmarks, containing 4, 15, and 94 language pairs, respectively. We show that SMoE outperforms multiple state-of-the-art MoE models with the same or fewer parameters.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the parameter efficiency issue in Mixture-of-Experts (MoE) models. Although existing MoE models significantly increase the number of model parameters through sparse activation mechanisms while maintaining low per-token computational costs, their performance gains diminish as the number of experts increases, demonstrating poor parameter efficiency. Specifically, all experts have the same capacity allocation, which may not meet the complexity needs of different tokens or tasks. To solve this problem, the authors propose the Stratified Mixture of Experts (SMoE) model, which dynamically allocates capacity to different tokens through a hierarchical structure. In this way, SMoE can utilize parameters more efficiently, achieving better performance in tasks such as multilingual machine translation. ### Main Contributions 1. **Introduction of Dynamic Capacity Concept**: A hierarchical mixture of experts model (SMoE) is proposed, which can automatically allocate dynamic capacity to different input tokens, making the expert model more parameter-efficient. 2. **Superior Performance in Multilingual Machine Translation Tasks**: In multilingual machine translation tasks, the SMoE model significantly outperforms several strong baseline models with the same or fewer parameters. For example, SMoE can achieve performance comparable to traditional MoE models with only half the parameters. 3. **In-depth Analysis of Factors Affecting Dynamic Capacity Allocation**: The paper explores factors affecting dynamic capacity allocation, including the language of the tokens and the position of the SMoE blocks in the model architecture. ### Experimental Results - **M4 Dataset**: On the M4 dataset containing 4 languages, the SMoE-2-2-2-2 configuration achieves an average BLEU score 0.9 higher than the Switch Transformer and 0.74 higher than traditional MoE models. - **M15 Dataset**: On the M15 dataset containing 15 languages, the best SMoE configuration (SMoE-4-12) achieves an average BLEU score 1.04 higher than the Switch Transformer and 0.93 higher than traditional MoE models. - **OPUS-100 Dataset**: On the OPUS-100 dataset containing 94 languages, the SMoE-4-12 configuration achieves an average BLEU score 1.01 higher than traditional MoE models and 1.63 higher than the Switch Transformer. ### Analysis - **Impact of Token Language**: In the decoder, the target language significantly affects the Request Capacity (RC) of tokens, while the source language has a smaller impact. In the encoder, RC is influenced by both the source and target languages. - **Impact of Token Frequency**: High-frequency tokens generally have lower RC because the model is overly exposed to these frequently occurring tokens and thus does not need much capacity to process them. Conversely, low-frequency tokens may require higher RC because they are more complex or rare. In summary, this paper significantly improves the parameter efficiency of Mixture-of-Experts models by introducing a hierarchical structure and dynamic capacity allocation mechanism, achieving excellent performance in multilingual machine translation tasks.

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models.

HMoE: Heterogeneous Mixture of Experts for Language Modeling

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Multi-Head Mixture-of-Experts

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Mixture of A Million Experts

From Sparse to Soft Mixtures of Experts

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

Taming Sparsely Activated Transformer with Stochastic Experts

MoEfication: Conditional Computation of Transformer Models for Efficient Inference

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

Mixture of Tokens: Continuous MoE through Cross-Example Aggregation

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference