Abstract:The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the training tasks. The inability to quickly identify the faulty components results in a substantial waste of GPU resources. Secondly, since GPUs must wait for parameter synchronization to complete before proceeding to the next round of computation, network congestions can greatly increase the waiting time for GPUs. To address these challenges, this paper introduces a communication-driven solution, namely the C4. The key insights of C4 are two folds. First, in parallel training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies are certainly due to some form of hardware malfunction. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving few large flows, allows C4 to efficiently execute traffic planning, substantially reducing network congestion. C4 has been extensively implemented across our production systems, cutting error-induced overhead by roughly 30% and enhancing runtime performance by about 15% for certain applications with moderate communication costs.

Distributed Training Optimization for DCU

DISTRIBUTED HIGH-PERFORMANCE COMPUTING METHODS FOR ACCELERATING DEEP LEARNING TRAINING

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Distributed Training Large-Scale Deep Architectures

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

HetHub: A Heterogeneous Distributed Hybrid Training System for Large-Scale Models

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism

Optimal distributed parallel algorithms for deep learning framework Tensorflow

Coded Parallelism for Distributed Deep Learning.

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

Improving Automatic Parallel Training Via Balanced Memory Workload Optimization

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost

Proteus: Simulating the Performance of Distributed DNN Training

An Efficient 2D Method for Training Super-Large Deep Learning Models

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming