MH-MoE: Multi-Head Mixture-of-Experts

Shaohan Huang,Xun Wu,Shuming Ma,Furu Wei
2024-11-26
Abstract:Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the performance of the language model by improving the Multi - Head Mixture - of - Experts (MH - MoE) mechanism while maintaining computational efficiency and the number of parameters unchanged. Specifically, the paper proposes a new MH - MoE implementation method, aiming to maintain the same number of floating - point operations (FLOPs) and the number of parameters as the Sparse Mixture of Experts (SMoE), while showing better performance on multiple language modeling tasks. ### Main Contributions 1. **New MH - MoE Implementation**: - A new implementation method that improves performance while keeping FLOPs and the number of parameters comparable to the sparse MoE model is proposed. - The effectiveness of the new method on multiple language modeling tasks has been verified through experiments, especially its performance in 1 - bit quantized models. 2. **Performance Improvement**: - The experimental results show that the new MH - MoE implementation not only outperforms the traditional SMoE and fine - grained SMoE models in the standard setting, but also performs well in the 1 - bit quantization setting. 3. **Structure Analysis**: - The influence of the head layer and the merge layer on the model performance has been analyzed through ablation experiments. It has been found that these layers play a key role in improving the model performance, especially the contribution of the head layer is more significant. ### Formulas and Details - **Basic Formula of MH - MoE**: - The input \(x\in\mathbb{R}^d\) becomes \(\hat{x} = xW_{\text{head}}\) after linear projection, where \(W_{\text{head}}\in\mathbb{R}^{d\times d}\). - \(\hat{x}\) is divided into \(h\) sub - vectors, and each sub - vector is processed by different experts. The final output \(y\) is composed of the weighted sum of these sub - vectors: \[ y=\sum_{p\in\Phi}G(\tilde{x})\cdot\text{Expert}_p(\tilde{x}) \] where \(\Phi\) is the set of activated experts and \(G(\tilde{x})\) is the gating function. - **Complexity Analysis**: - Calculate the number of scalar multiplications in MH - MoE: \[ \text{Head Layer}+\text{Activated Experts}+\text{Merge Layer} \] \[ 2Bd^2 - Bd+(4Bdd_{\text{moe}} - Bd - Bd_{\text{moe}}h)\cdot k+2Bd^2 - Bd \] - Assuming the use of top - 1 gating (i.e., \(k = 1\)), the intermediate dimension \(d_{\text{moe}} = 4d\), for sparse MoE, the number of scalar multiplications is \(16Bd^2 - 5Bd\). - In MH - MoE, by adjusting the parameters, the leading term of the number of scalar multiplications is made the same as that of sparse MoE. For example, when \(h = 2\), setting \(d_{\text{moe}} = 3d\), the number of scalar multiplications is \(16Bd^2 - 6Bd\). ### Experimental Results - **Language Modeling Evaluation**: - Pre - training and validation are carried out on the RedPajama dataset, and the results show that MH - MoE outperforms SMoE and fine - grained SMoE models in different settings. - Especially in the 1 - bit quantization setting, the performance of MH - MoE is particularly prominent. - **Ablation Experiment**: - The influence of the head layer and the merge layer on the model performance has been analyzed, and it has been found that these layers play an important role in improving the model performance.