Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

Juncai Liu,Jessie Hui Wang,Yimin Jiang
DOI: https://doi.org/10.1145/3603269.3604869
2023-01-01
Abstract:Scaling models to large sizes to improve performance has led a trend in deep learning, and sparsely activated Mixture-of-Expert (MoE) is a promising architecture to scale models. However, training MoE models in existing systems is expensive, mainly due to the All-to-All communication between layers. All-to-All communication originates from expert-centric paradigm: keeping experts in-place and exchanging intermediate data to feed experts. We propose the novel data-centric paradigm: keeping data in-place and moving experts between GPUs. Since experts' size can be smaller than the size of data, data-centric paradigm can reduce communication workload. Based on this insight, we develop Janus. First, Janus supports fine-grained asynchronous communication, which can overlap computation and communication. Janus implements a hierarchical communication to further reduce cross-node traffic by sharing the fetched experts in the same machine. Second, when scheduling the "fetching expert" requests, Janus implements a topology-aware priority strategy to utilize intra-node and inter-node links efficiently. Finally, Janus allows experts to be prefetched, which allows the downstream computation to start immediately once the previous step completes. Evaluated on a 32-A100 cluster, Janus can reduce the traffic up to 16× and achieves up to 2.06× speedup compared with current MoE training system.
What problem does this paper attempt to address?