What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the performance of the language model by improving the Multi - Head Mixture - of - Experts (MH - MoE) mechanism while maintaining computational efficiency and the number of parameters unchanged. Specifically, the paper proposes a new MH - MoE implementation method, aiming to maintain the same number of floating - point operations (FLOPs) and the number of parameters as the Sparse Mixture of Experts (SMoE), while showing better performance on multiple language modeling tasks. ### Main Contributions 1. **New MH - MoE Implementation**: - A new implementation method that improves performance while keeping FLOPs and the number of parameters comparable to the sparse MoE model is proposed. - The effectiveness of the new method on multiple language modeling tasks has been verified through experiments, especially its performance in 1 - bit quantized models. 2. **Performance Improvement**: - The experimental results show that the new MH - MoE implementation not only outperforms the traditional SMoE and fine - grained SMoE models in the standard setting, but also performs well in the 1 - bit quantization setting. 3. **Structure Analysis**: - The influence of the head layer and the merge layer on the model performance has been analyzed through ablation experiments. It has been found that these layers play a key role in improving the model performance, especially the contribution of the head layer is more significant. ### Formulas and Details - **Basic Formula of MH - MoE**: - The input \(x\in\mathbb{R}^d\) becomes \(\hat{x} = xW_{\text{head}}\) after linear projection, where \(W_{\text{head}}\in\mathbb{R}^{d\times d}\). - \(\hat{x}\) is divided into \(h\) sub - vectors, and each sub - vector is processed by different experts. The final output \(y\) is composed of the weighted sum of these sub - vectors: \[ y=\sum_{p\in\Phi}G(\tilde{x})\cdot\text{Expert}_p(\tilde{x}) \] where \(\Phi\) is the set of activated experts and \(G(\tilde{x})\) is the gating function. - **Complexity Analysis**: - Calculate the number of scalar multiplications in MH - MoE: \[ \text{Head Layer}+\text{Activated Experts}+\text{Merge Layer} \] \[ 2Bd^2 - Bd+(4Bdd_{\text{moe}} - Bd - Bd_{\text{moe}}h)\cdot k+2Bd^2 - Bd \] - Assuming the use of top - 1 gating (i.e., \(k = 1\)), the intermediate dimension \(d_{\text{moe}} = 4d\), for sparse MoE, the number of scalar multiplications is \(16Bd^2 - 5Bd\). - In MH - MoE, by adjusting the parameters, the leading term of the number of scalar multiplications is made the same as that of sparse MoE. For example, when \(h = 2\), setting \(d_{\text{moe}} = 3d\), the number of scalar multiplications is \(16Bd^2 - 6Bd\). ### Experimental Results - **Language Modeling Evaluation**: - Pre - training and validation are carried out on the RedPajama dataset, and the results show that MH - MoE outperforms SMoE and fine - grained SMoE models in different settings. - Especially in the 1 - bit quantization setting, the performance of MH - MoE is particularly prominent. - **Ablation Experiment**: - The influence of the head layer and the merge layer on the model performance has been analyzed, and it has been found that these layers play an important role in improving the model performance.

MH-MoE: Multi-Head Mixture-of-Experts

Multi-Head Mixture-of-Experts

HMoE: Heterogeneous Mixture of Experts for Language Modeling

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

Mixture of Attention Heads: Selecting Attention Heads Per Token

A Mixture of Heads is Better than Heads

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

A Mixture of $H-1$ Heads is Better Than $h$ Heads

HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts

Mixture of Diverse Size Experts

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

A Closer Look into Mixture-of-Experts in Large Language Models

MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

MoEC: Mixture of Expert Clusters

AC-MMOE: A Multi-gate Mixture-of-experts Model Based on Attention and Convolution