Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Haoran Xu,Maha Elbayad,Kenton Murray,Jean Maillard,Vedanuj Goswami
2023-10-23
Abstract:Mixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the improvement in performance diminishes with an increasing number of experts. We hypothesize this parameter inefficiency is a result of all experts having equal capacity, which may not adequately meet the varying complexity requirements of different tokens or tasks. In light of this, we propose Stratified Mixture of Experts (SMoE) models, which feature a stratified structure and can assign dynamic capacity to different tokens. We demonstrate the effectiveness of SMoE on three multilingual machine translation benchmarks, containing 4, 15, and 94 language pairs, respectively. We show that SMoE outperforms multiple state-of-the-art MoE models with the same or fewer parameters.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the parameter efficiency issue in Mixture-of-Experts (MoE) models. Although existing MoE models significantly increase the number of model parameters through sparse activation mechanisms while maintaining low per-token computational costs, their performance gains diminish as the number of experts increases, demonstrating poor parameter efficiency. Specifically, all experts have the same capacity allocation, which may not meet the complexity needs of different tokens or tasks. To solve this problem, the authors propose the Stratified Mixture of Experts (SMoE) model, which dynamically allocates capacity to different tokens through a hierarchical structure. In this way, SMoE can utilize parameters more efficiently, achieving better performance in tasks such as multilingual machine translation. ### Main Contributions 1. **Introduction of Dynamic Capacity Concept**: A hierarchical mixture of experts model (SMoE) is proposed, which can automatically allocate dynamic capacity to different input tokens, making the expert model more parameter-efficient. 2. **Superior Performance in Multilingual Machine Translation Tasks**: In multilingual machine translation tasks, the SMoE model significantly outperforms several strong baseline models with the same or fewer parameters. For example, SMoE can achieve performance comparable to traditional MoE models with only half the parameters. 3. **In-depth Analysis of Factors Affecting Dynamic Capacity Allocation**: The paper explores factors affecting dynamic capacity allocation, including the language of the tokens and the position of the SMoE blocks in the model architecture. ### Experimental Results - **M4 Dataset**: On the M4 dataset containing 4 languages, the SMoE-2-2-2-2 configuration achieves an average BLEU score 0.9 higher than the Switch Transformer and 0.74 higher than traditional MoE models. - **M15 Dataset**: On the M15 dataset containing 15 languages, the best SMoE configuration (SMoE-4-12) achieves an average BLEU score 1.04 higher than the Switch Transformer and 0.93 higher than traditional MoE models. - **OPUS-100 Dataset**: On the OPUS-100 dataset containing 94 languages, the SMoE-4-12 configuration achieves an average BLEU score 1.01 higher than traditional MoE models and 1.63 higher than the Switch Transformer. ### Analysis - **Impact of Token Language**: In the decoder, the target language significantly affects the Request Capacity (RC) of tokens, while the source language has a smaller impact. In the encoder, RC is influenced by both the source and target languages. - **Impact of Token Frequency**: High-frequency tokens generally have lower RC because the model is overly exposed to these frequently occurring tokens and thus does not need much capacity to process them. Conversely, low-frequency tokens may require higher RC because they are more complex or rare. In summary, this paper significantly improves the parameter efficiency of Mixture-of-Experts models by introducing a hierarchical structure and dynamic capacity allocation mechanism, achieving excellent performance in multilingual machine translation tasks.