Abstract:Neural network training requires a large amount of computation and thus GPUs are often used for the acceleration. While they improve the performance, GPUs are underutilized during the <a class="link-external link-http" href="http://training.This" rel="external noopener nofollow">this http URL</a> paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training. By exploiting the dependencies of gradient computations, ooo backprop enables to reorder their executions to make the most of the GPU resources. We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo back-prop and prioritizing critical operations. We propose three scheduling algorithms based on ooo backprop. For single-GPU training, we schedule with multi-stream out-of-order computation to mask the kernel launch overhead. In data-parallel training, we reorder the gradient computations to maximize the overlapping of computation and parameter communication; in pipeline-parallel training, we prioritize critical gradient computations to reduce the pipeline <a class="link-external link-http" href="http://stalls.We" rel="external noopener nofollow">this http URL</a> evaluate our optimizations with twelve neural networks including a light-weight computer vision model (MobileNet) and largeNLP models (BERT and GPT-3) with up to forty eight V100 <a class="link-external link-http" href="http://GPUs.Our" rel="external noopener nofollow">this http URL</a> scheduling algorithms effectively improve the performance of single-GPU training as well as data- and pipeline-parallel <a class="link-external link-http" href="http://training.Compared" rel="external noopener nofollow">this http URL</a> to the respective state of the art training systems, the throughput is substantially improved for single-GPU, data-parallel, and pipeline-parallel training.

Prophet: Speeding Up Distributed DNN Training with Predictable Communication Scheduling.

US-Byte: an Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training

Proteus: Simulating the Performance of Distributed DNN Training

Accelerating Distributed DNN Training via Transport Layer Scheduling

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

A generic communication scheduler for distributed DNN training acceleration

MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

Scheduling Optimization Techniques for Neural Network Training

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

From promise to practice: realizing high-performance decentralized training

Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

Mercury: A Simple Transport Layer Scheduler to Accelerate Distributed DNN Training