FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

Jiaao He,Jidong Zhai,Tiago Antunes,Haojie Wang,Fuwen Luo,Shangfeng Shi,Qin Li
DOI: https://doi.org/10.1145/3503221.3508418
2022-01-01
Abstract:The current trend in deep learning is to scale models to extremely large sizes with the objective of increasing their accuracy. Mixture-of-Expert (MoE) is the most popular pretrained model that makes feasible the training of models with parameters beyond trillion-scale. Thanks to the dynamic activation of experts, i.e., shallow layers specialized in certain domains, it allows for sparse training of bigger models, removing the linearity between model size and computation. However, different from traditional deep learning models, it draws huge challenges to the efficiency of these training systems, including dynamic load imbalance, inefficient synchronous execution mode, and congested all-to-all communication. To address these challenges, we first propose a performance model that can both accurately predict the latency of different operations o f a specific training task, and intuitively analyze its end-to-end performance via a novel roofline-like model. Then, guided by this model, we invent a dynamic shadowing approach to cope with load imbalance, and a smart fine-grained schedule that splits different operations and executes them concurrently. We design a congestion-avoiding expert selection strategy that relieves network congestion for the lower latency of iterations, when modification of expert selection is allowed. We implement and integrate the above optimizations as a general system, FASTERMOE, empowering efficient distributed MoE model training. FASTERMOE is evaluated on different cluster systems using up to 64 GPUs. It achieves 1.37x - 17.87x speedup compared with state-of-the-art systems for large models, including ZeRO, GShard, and BASE Layer. Source code of FASTERMoE is now available at https://github.com/thu-pacman/FasterMoE.
What problem does this paper attempt to address?