MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization

Jingming Guo,Yan Liu,Yu Meng,Zhiwei Tao,Banglan Liu,Gang Chen,Xiang Li
2024-11-01
Abstract:The Mixture of Experts (MoE) is an advanced model architecture in the industry that combines multiple specialized expert models from various domains into a single supermodel. This approach enables the model to scale without significantly increasing the computational costs of training and inference, while maximizing model performance. However, current distributed training frameworks do not consider the ultimate optimization of communication, especially for large base models. This paper proposes a network-traffic-aware parallel optimization method that selects the optimal parallel strategy based on the communication volume, and the training cluster's inter-node and intra-node network topologies. Compared to the DeepSpeed, MoNTA achieves an 8x increase in AllToAll communication performance under 8-card tensor parallelism. Compared to the baseline, training a 2x70B model using 16 A800 cards, with an 8K sequence, results in a 13% overall latency performance improvement. Project Page: <a class="link-external link-https" href="https://github.com/EnflameTechnology/DeepSpeed" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of communication efficiency in large-scale distributed training of Mixture of Experts (MoE) models. Specifically, current distributed training frameworks do not sufficiently optimize communication, especially when dealing with large foundational models, which affects overall computational efficiency. The paper proposes a network traffic-aware parallel optimization method (MoNTA) aimed at improving communication efficiency by selecting the optimal parallel strategy, thereby accelerating the training of MoE models. ### Main Issues: 1. **Low Communication Efficiency**: Existing distributed training frameworks do not sufficiently optimize communication when handling MoE models. Particularly in large-scale models and multi-node environments, communication overhead becomes a bottleneck. 2. **Insufficient Resource Utilization**: Current methods do not fully utilize high-bandwidth communication resources between and within nodes, leading to low utilization of computational resources. 3. **Lack of Comprehensive Optimization**: Existing optimization methods do not comprehensively consider communication volume, communication efficiency, and network topology, resulting in limited optimization effects. ### Solutions: 1. **Network Traffic-Aware Parallel Optimization Method (MoNTA)**: This method selects the optimal parallel strategy based on communication volume and the network topology of the training cluster to improve communication efficiency. 2. **Data Redundancy Utilization**: By utilizing data redundancy in AllToAll communication under tensor parallelism, AllToAll communication is transformed into a combination of inter-node AllToAll and intra-node communication, thereby improving communication efficiency. 3. **Communication Data Slicing**: Based on the relationship between communication volume and communication efficiency, communication data is divided into different slices to ensure communication efficiency and achieve greater communication overlap. 4. **Utilization of High-Bandwidth Intra-Node Connections**: By fully utilizing high-bandwidth intra-node connections, communication efficiency is improved, thereby enhancing chip computational utilization. ### Experimental Results: - Compared to the DeepSpeed baseline, MoNTA improves AllToAll communication performance by approximately 8 times under 8-card tensor parallelism. - When training a 2x70B model using 16 A800 cards with a sequence length of 8K, overall latency performance is improved by 13%. ### Contributions: 1. Proposes a communication-aware parallel optimization method MoNTA, utilizing inter-node and intra-node communication resources to achieve pipelining of inter-node AllToAll and intra-node communication, and establishes a performance model of communication volume, communication efficiency, and parallel strategy to achieve MoE AllToAll communication overlap, improving computational utilization. 2. Introduces pipelining of intra-node communication and D2D replication, further reducing AllToAll overhead. 3. Analyzes communication conflict issues during the MoE model training process and provides a communication priority scheme. 4. Proposes an extension method for distributed parallel training clusters of long-context MoE models, generating distributed parallel extension strategies based on cluster resource parameters, model parameters, and context length. Through these methods, MoNTA effectively addresses the communication efficiency issue in large-scale distributed training of MoE models, significantly enhancing training performance.