Abstract:Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. In particular, we show that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-k expert selection rule that adjusts the number of executed experts on a per-token basis. Finally, we extend this approach to multi-head attention projections, which results in additional savings compared to only converting the FFN blocks. The proposed method, Dense to Dynamic-$k$ Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, allowing us to save up to 60% of inference cost without significantly affecting model performance.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the high computational requirements faced by the Transformer model in practical applications. Specifically, although the Transformer model performs well on many tasks, its large amount of computation limits their deployment and use in resource - limited environments. To address this challenge, the authors propose a new method - converting from a dense model to a Dynamic - k Mixture - of - Experts (Dense to Dynamic - k Mixture - of - Experts, D2DMoE) model, to reduce the inference cost without significantly affecting the model performance.
### 1. **Problem Background**
The Transformer model dominates in various deep - learning fields such as machine translation, language modeling, and computer vision. The effectiveness of these models is closely related to their ability to scale the number of parameters, prompting researchers to train larger and larger models. However, the high computational requirements of these large models often limit their application in practical environments.
At the same time, it has been found that the Transformer model exhibits significant activation sparsity in its intermediate representations, which means that most of the computations may be redundant. Through conditional computation methods, unnecessary computational costs can be reduced. In particular, the Mixture - of - Experts (MoE) layer is an effective method that can decouple the relationship between the number of model parameters and the computational cost by sparsely executing multiple experts.
### 2. **Limitations of Existing Methods**
Although the existing MoEfication methods can convert a dense Transformer model into a more efficient MoE model, they have limitations in the following aspects:
- **Insufficient exploration of the impact of activation sparsity**: The impact of activation sparsity on conversion efficiency has not been fully studied.
- **Limitations of the router training scheme**: The router training scheme in the original MoEfication algorithm limits the effect of the conversion process.
- **Inefficiency of the static top - k selection rule**: The standard top - k expert selection rule cannot adapt to the significant differences in the number of activated neurons between different inputs.
- **Conversion limited to FFN layers**: Existing methods mainly focus on the conversion of feed - forward neural network (FFN) layers, ignoring other parts such as the multi - head attention (MHA) layer.
### 3. **Solutions Proposed in the Paper**
To solve the above problems, the authors propose the D2DMoE method, which mainly includes the following innovations:
1. **Enhancing activation sparsity**: Through a lightweight fine - tuning process, the activation sparsity level of the base model is forcibly increased, thereby significantly improving the trade - off between cost and performance.
2. **Improving the router training scheme**: The router training is framed as a regression problem to directly predict the ℓ2 norm of each expert's output, improving the accuracy of expert contributions.
3. **Introducing a dynamic k - selection rule**: The number of experts to be executed is dynamically adjusted according to the predicted contribution of each input token, enabling the model to allocate computational resources more efficiently.
4. **Extension to multi - head attention layers**: The conversion method is generalized to any independent linear layer, including the gated MLP variants in modern LLMs and the projections of the MHA layer, further reducing the computational cost.
### 4. **Experimental Results**
The authors conducted experiments on multiple benchmark datasets such as text classification, image classification, and language modeling. The results show that the D2DMoE method outperforms existing methods under different computational budgets and can maintain the model performance while saving up to 60% of the inference cost.
In conclusion, this paper aims to solve the high computational requirements of the Transformer model in practical applications through the D2DMoE method and significantly improves the computational efficiency of the model through a series of innovative means.