Abstract:The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperative to scale training. The primary bottleneck in scaling remains the communication overhead. The communication scheduling technique, emphasizing the overlap of communication with computation, has demonstrated its benefits in scaling. However, most existing works focus on data parallelism, overlooking the nuances of hybrid parallel training. In this paper, we propose TriRace , an efficient communication scheduling framework for accelerating communications in hybrid parallel training of asynchronous pipeline parallelism and data parallelism. To achieve effective computation-communication overlap, TriRace introduces 3D communication scheduling , which adeptly leverages data dependencies between communication and computations, efficiently scheduling AllReduce communication, sparse communication, and peer-to-peer communication in hybrid parallel training. To avoid possible communication contentions, TriRace also incorporates a topology-aware runtime which optimizes the execution of communication operations by considering ongoing communication operations and real-time network status. We have implemented a prototype of TriRace based on PyTorch and Pipedream-2BW, and conducted comprehensive evaluations with three representative baselines. Experimental results show that TriRace achieves up to 1.07–1.45× speedup compared to the state-of-the-art pipeline parallelism training baseline Pipedream-2BW, and 1.24–1.81× speedup compared to the Megatron.

Training Acceleration for Deep Neural Networks: A Hybrid Parallelization Strategy

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

PipePar: A Pipelined Hybrid Parallel Approach for Accelerating Distributed DNN Training

Model-Aware Parallelization Strategy for Deep Neural Networks' Distributed Training

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

A Memory-efficient Hybrid Parallel Framework for Deep Neural Network Training

AccTFM: an Effective Intra-Layer Model Parallelization Strategy for Training Large-Scale Transformer-Based Models.

A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training

Accelerating Training For Distributed Deep Neural Networks In Mapreduce

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Adaptive Distributed Parallel Training Method for a Deep Learning Model Based on Dynamic Critical Paths of DAG

Enabling Energy-Efficient DNN Training on Hybrid GPU-FPGA Accelerators.

ACCELERATING THE TRAINING OF ARTIFICIAL NEURAL NETWORKS USING DATA PARALLELIZATION

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

MP-DPS: Adaptive Distributed Training for Deep Learning Based on Node Merging and Path Prediction

DISTRIBUTED HIGH-PERFORMANCE COMPUTING METHODS FOR ACCELERATING DEEP LEARNING TRAINING

A Practical Implementation of GPU based Accelerator for Deep Neural Networks

Distributed Training Optimization for DCU

Aware: Adaptive Distributed Training with Computation, Communication and Position Awareness for Deep Learning Model.