MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

Zhitian Xie,Yinger Zhang,Chenyi Zhuang,Qitao Shi,Zhining Liu,Jinjie Gu,Guannan Zhang
2024-01-31
Abstract:The application of mixture-of-experts (MoE) is gaining popularity due to its ability to improve model's performance. In an MoE structure, the gate layer plays a significant role in distinguishing and routing input features to different experts. This enables each expert to specialize in processing their corresponding sub-tasks. However, the gate's routing mechanism also gives rise to narrow vision: the individual MoE's expert fails to use more samples in learning the allocated sub-task, which in turn limits the MoE to further improve its generalization ability. To effectively address this, we propose a method called Mixture-of-Distilled-Expert (MoDE), which applies moderate mutual distillation among experts to enable each expert to pick up more features learned by other experts and gain more accurate perceptions on their original allocated sub-tasks. We conduct plenty experiments including tabular, NLP and CV datasets, which shows MoDE's effectiveness, universality and robustness. Furthermore, we develop a parallel study through innovatively constructing "expert probing", to experimentally prove why MoDE works: moderate distilling knowledge can improve each individual expert's test performances on their assigned tasks, leading to MoE's overall performance improvement.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the "narrow - view" problem in the Mixture - of - Experts (MoE). In the traditional MoE structure, the gating layer is responsible for routing input features to different experts, and each expert focuses on processing the sub - tasks assigned to it. However, this mechanism causes each expert to only have access to a limited number of samples, thus limiting the generalization ability of the model. Specifically, the paper points out the following: 1. **Narrow - view problem**: Since each expert only has access to the limited samples assigned by the gating layer, they cannot learn from more data, which limits the overall generalization ability of the model. 2. **Solution**: To solve this problem, the paper proposes a new method - Mixture - of - Distilled - Experts (MoDE). By introducing moderate mutual distillation among experts, each expert can learn more features from other experts, thereby improving the understanding and performance of the assigned sub - tasks. ### Main contributions 1. **Proposing the MoDE model**: By introducing the mutual distillation mechanism into the MoE structure, MoDE enables each expert not only to maintain its specialty but also to learn more useful features from other experts, thereby improving the generalization ability of the overall model. 2. **Experimental verification**: The paper conducts a large number of experiments on multiple datasets, including tabular data, natural language processing (NLP) and computer vision (CV) tasks, proving the effectiveness, versatility and robustness of MoDE. 3. **Mechanism analysis**: Through the innovative "expert probing" method, the paper analyzes in detail why MoDE is effective and how mutual distillation improves the test performance of each expert on its assigned task, thereby enhancing the performance of the entire MoE model. ### Experimental results - **Tabular datasets**: MoDE significantly improves the test accuracy on multiple classification tasks. - **Natural language datasets**: MoDE achieves an improvement in BLEU scores on both low - resource and high - resource translation tasks. - **Computer vision datasets**: MoDE also significantly improves the test accuracy on multiple classic CV tasks. ### Mechanism discussion 1. **Experts perform better in their own task domains**: Through the expert probing method, the paper finds that each expert in MoDE not only maintains its specialty but also significantly improves the accuracy in its own task domain. 2. **The gating layer knows the experts better**: The gating layer in MoDE is more inclined to select appropriate experts when assigning weights, thereby improving the recognition accuracy. 3. **The role of mutual distillation**: Moderate mutual distillation prompts each expert to pay attention to other neglected features, thus showing better performance during testing. In conclusion, this paper effectively solves the narrow - view problem in the MoE model by introducing the mutual distillation mechanism, improving the generalization ability and overall performance of the model.