Scheduling Optimization Techniques for Neural Network Training

Hyungjun Oh,Hyungjun Oh,HyeongJu Kim,Jiwon Seo
DOI: https://doi.org/10.48550/arXiv.2110.00929
2021-10-03
Abstract:Neural network training requires a large amount of computation and thus GPUs are often used for the acceleration. While they improve the performance, GPUs are underutilized during the <a class="link-external link-http" href="http://training.This" rel="external noopener nofollow">this http URL</a> paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training. By exploiting the dependencies of gradient computations, ooo backprop enables to reorder their executions to make the most of the GPU resources. We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo back-prop and prioritizing critical operations. We propose three scheduling algorithms based on ooo backprop. For single-GPU training, we schedule with multi-stream out-of-order computation to mask the kernel launch overhead. In data-parallel training, we reorder the gradient computations to maximize the overlapping of computation and parameter communication; in pipeline-parallel training, we prioritize critical gradient computations to reduce the pipeline <a class="link-external link-http" href="http://stalls.We" rel="external noopener nofollow">this http URL</a> evaluate our optimizations with twelve neural networks including a light-weight computer vision model (MobileNet) and largeNLP models (BERT and GPT-3) with up to forty eight V100 <a class="link-external link-http" href="http://GPUs.Our" rel="external noopener nofollow">this http URL</a> scheduling algorithms effectively improve the performance of single-GPU training as well as data- and pipeline-parallel <a class="link-external link-http" href="http://training.Compared" rel="external noopener nofollow">this http URL</a> to the respective state of the art training systems, the throughput is substantially improved for single-GPU, data-parallel, and pipeline-parallel training.
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?