Abstract:The Mixture of Experts (MoE) is an advanced model architecture in the industry that combines multiple specialized expert models from various domains into a single supermodel. This approach enables the model to scale without significantly increasing the computational costs of training and inference, while maximizing model performance. However, current distributed training frameworks do not consider the ultimate optimization of communication, especially for large base models. This paper proposes a network-traffic-aware parallel optimization method that selects the optimal parallel strategy based on the communication volume, and the training cluster's inter-node and intra-node network topologies. Compared to the DeepSpeed, MoNTA achieves an 8x increase in AllToAll communication performance under 8-card tensor parallelism. Compared to the baseline, training a 2x70B model using 16 A800 cards, with an 8K sequence, results in a 13% overall latency performance improvement. Project Page: <a class="link-external link-https" href="https://github.com/EnflameTechnology/DeepSpeed" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the issue of communication efficiency in large-scale distributed training of Mixture of Experts (MoE) models. Specifically, current distributed training frameworks do not sufficiently optimize communication, especially when dealing with large foundational models, which affects overall computational efficiency. The paper proposes a network traffic-aware parallel optimization method (MoNTA) aimed at improving communication efficiency by selecting the optimal parallel strategy, thereby accelerating the training of MoE models. ### Main Issues: 1. **Low Communication Efficiency**: Existing distributed training frameworks do not sufficiently optimize communication when handling MoE models. Particularly in large-scale models and multi-node environments, communication overhead becomes a bottleneck. 2. **Insufficient Resource Utilization**: Current methods do not fully utilize high-bandwidth communication resources between and within nodes, leading to low utilization of computational resources. 3. **Lack of Comprehensive Optimization**: Existing optimization methods do not comprehensively consider communication volume, communication efficiency, and network topology, resulting in limited optimization effects. ### Solutions: 1. **Network Traffic-Aware Parallel Optimization Method (MoNTA)**: This method selects the optimal parallel strategy based on communication volume and the network topology of the training cluster to improve communication efficiency. 2. **Data Redundancy Utilization**: By utilizing data redundancy in AllToAll communication under tensor parallelism, AllToAll communication is transformed into a combination of inter-node AllToAll and intra-node communication, thereby improving communication efficiency. 3. **Communication Data Slicing**: Based on the relationship between communication volume and communication efficiency, communication data is divided into different slices to ensure communication efficiency and achieve greater communication overlap. 4. **Utilization of High-Bandwidth Intra-Node Connections**: By fully utilizing high-bandwidth intra-node connections, communication efficiency is improved, thereby enhancing chip computational utilization. ### Experimental Results: - Compared to the DeepSpeed baseline, MoNTA improves AllToAll communication performance by approximately 8 times under 8-card tensor parallelism. - When training a 2x70B model using 16 A800 cards with a sequence length of 8K, overall latency performance is improved by 13%. ### Contributions: 1. Proposes a communication-aware parallel optimization method MoNTA, utilizing inter-node and intra-node communication resources to achieve pipelining of inter-node AllToAll and intra-node communication, and establishes a performance model of communication volume, communication efficiency, and parallel strategy to achieve MoE AllToAll communication overlap, improving computational utilization. 2. Introduces pipelining of intra-node communication and D2D replication, further reducing AllToAll overhead. 3. Analyzes communication conflict issues during the MoE model training process and provides a communication priority scheme. 4. Proposes an extension method for distributed parallel training clusters of long-context MoE models, generating distributed parallel extension strategies based on cluster resource parameters, model parameters, and context length. Through these methods, MoNTA effectively addresses the communication efficiency issue in large-scale distributed training of MoE models, significantly enhancing training performance.

MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization

TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization.

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

FastMoE: A Fast Mixture-of-Expert Training System

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling.

FasterMoE

Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes

Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism

MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference