Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation

Fahao Chen,Peng Li,Zicong Hong,Zhou Su,Song Guo
2024-11-23
Abstract:Mixture-of-Experts (MoE) is an emerging technique for scaling large models with sparse activation. MoE models are typically trained in a distributed manner with an expert parallelism scheme, where experts in each MoE layer are distributed across multiple GPUs. However, the default expert parallelism suffers from the heavy network burden due to the all-to-all intermediate data exchange among GPUs before and after the expert run. Some existing works have proposed to reduce intermediate data exchanges by transferring experts to reduce the network loads, however, which would decrease parallelism level of expert execution and make computation inefficient. The weaknesses of existing works motivate us to explore whether it is possible to reduce inter-GPU traffic while maintaining a high degree of expert parallelism. This paper gives a positive response by presenting Luffy, a communication-efficient distributed MoE training system with two new techniques. First, Luffy migrates sequences among GPUs to hide heavy token pulling paths within GPUs and avoid copying experts over GPUs. Second, we propose token condensation that identifies similar tokens and then eliminates redundant transmissions. We implement Luffy based on PyTorch and evaluate its performance on a testbed of 16 V100 GPUs. Luffy system can achieve a speedup of up to 2.73x compared to state-of-the-art MoE training systems.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to reduce the amount of communication between GPUs while maintaining a high degree of parallelism when training large - scale sparsely - activated models (such as Mixture - of - Experts, MoE) in a distributed manner. Specifically, although the existing expert parallel schemes can maximize the utilization of GPU resources, they require a large amount of data exchange between GPUs, which leads to an excessive network burden and seriously affects the training efficiency. Some existing works attempt to reduce the exchange of intermediate data by migrating experts, but this will reduce the parallelism and affect the computational efficiency. Therefore, this paper proposes a new method, aiming to effectively reduce the cross - GPU communication without sacrificing parallelism, thereby increasing the training speed of the distributed MoE model. To achieve this goal, the paper proposes the LUFFY system, which contains two key technologies: 1. **Sequence Migration**: During the Combine Phase, LUFFY hides significant token - fetching paths by migrating sequences to the GPU that processes most of their tokens, avoiding replicating experts between GPUs, thereby reducing the cross - GPU communication. In addition, sequence migration also provides a new opportunity to optimize attention calculations. By clustering sequences of similar lengths, the number of zero - paddings is reduced, and the efficiency of batch processing is improved. 2. **Token Condensation**: During the Dispatch Phase, LUFFY further reduces the cross - GPU communication by identifying and eliminating similar tokens. Since the similarity of tokens is still retained after being processed by experts, token condensation is not only effective in the dispatch phase but also can reduce communication in the combine phase. At the same time, similar tokens are quickly identified through a fast heuristic algorithm, and the approximation error in the expert calculation process is constrained to ensure the convergence of training. Through these technologies, LUFFY can significantly reduce the cross - GPU communication while maintaining a high degree of parallelism, thereby greatly increasing the training speed of the distributed MoE model. Experimental results show that on a test platform with 16 V100 GPUs, LUFFY can achieve an acceleration effect of up to 2.73 times compared with the state - of - the - art MoE training system.