FasterMoE

Jiaao He,Jidong Zhai,Tiago Antunes,Haojie Wang,Fuwen Luo,Shangfeng Shi,Qin Li
DOI: https://doi.org/10.1145/3503221.3508418
2022-01-01
Abstract:The current trend in deep learning is to scale models to extremely large sizes with the objective of increasing their accuracy. Mixture-of-Expert (MoE) is the most popular pre-trained model that makes feasible the training of models with parameters beyond trillion-scale. Thanks to the dynamic activation of experts, i.e., shallow layers specialized in certain domains, it allows for sparse training of bigger models, removing the linearity between model size and computation. However, different from traditional deep learning models, it draws huge challenges to the efficiency of these training systems, including dynamic load imbalance, inefficient synchronous execution mode, and congested all-to-all communication.
What problem does this paper attempt to address?