Abstract:Mixture-of-Experts (MoE) is an emerging technique for scaling large models with sparse activation. MoE models are typically trained in a distributed manner with an expert parallelism scheme, where experts in each MoE layer are distributed across multiple GPUs. However, the default expert parallelism suffers from the heavy network burden due to the all-to-all intermediate data exchange among GPUs before and after the expert run. Some existing works have proposed to reduce intermediate data exchanges by transferring experts to reduce the network loads, however, which would decrease parallelism level of expert execution and make computation inefficient. The weaknesses of existing works motivate us to explore whether it is possible to reduce inter-GPU traffic while maintaining a high degree of expert parallelism. This paper gives a positive response by presenting Luffy, a communication-efficient distributed MoE training system with two new techniques. First, Luffy migrates sequences among GPUs to hide heavy token pulling paths within GPUs and avoid copying experts over GPUs. Second, we propose token condensation that identifies similar tokens and then eliminates redundant transmissions. We implement Luffy based on PyTorch and evaluate its performance on a testbed of 16 V100 GPUs. Luffy system can achieve a speedup of up to 2.73x compared to state-of-the-art MoE training systems.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to reduce the amount of communication between GPUs while maintaining a high degree of parallelism when training large - scale sparsely - activated models (such as Mixture - of - Experts, MoE) in a distributed manner. Specifically, although the existing expert parallel schemes can maximize the utilization of GPU resources, they require a large amount of data exchange between GPUs, which leads to an excessive network burden and seriously affects the training efficiency. Some existing works attempt to reduce the exchange of intermediate data by migrating experts, but this will reduce the parallelism and affect the computational efficiency. Therefore, this paper proposes a new method, aiming to effectively reduce the cross - GPU communication without sacrificing parallelism, thereby increasing the training speed of the distributed MoE model. To achieve this goal, the paper proposes the LUFFY system, which contains two key technologies: 1. **Sequence Migration**: During the Combine Phase, LUFFY hides significant token - fetching paths by migrating sequences to the GPU that processes most of their tokens, avoiding replicating experts between GPUs, thereby reducing the cross - GPU communication. In addition, sequence migration also provides a new opportunity to optimize attention calculations. By clustering sequences of similar lengths, the number of zero - paddings is reduced, and the efficiency of batch processing is improved. 2. **Token Condensation**: During the Dispatch Phase, LUFFY further reduces the cross - GPU communication by identifying and eliminating similar tokens. Since the similarity of tokens is still retained after being processed by experts, token condensation is not only effective in the dispatch phase but also can reduce communication in the combine phase. At the same time, similar tokens are quickly identified through a fast heuristic algorithm, and the approximation error in the expert calculation process is constrained to ensure the convergence of training. Through these technologies, LUFFY can significantly reduce the cross - GPU communication while maintaining a high degree of parallelism, thereby greatly increasing the training speed of the distributed MoE model. Experimental results show that on a test platform with 16 V100 GPUs, LUFFY can achieve an acceleration effect of up to 2.73 times compared with the state - of - the - art MoE training system.

Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling.

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

ScheMoE

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

FastMoE: A Fast Mixture-of-Expert Training System

LocMoE: A Low-Overhead MoE for Large Language Model Training

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

Llama 3 Meets MoE: Efficient Upcycling

HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy

Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling