Abstract:Deep neural network (DNN) has been widely applied in many fields of artificial intelligence (AI), gaining great popularity both in industry and academia. Increasing the size of DNN models does dramatically improve the learning accuracy. However, training large-scale DNN models on a single GPU takes unacceptable waiting time. In order to speed up the training process, many distributed deep learning (DL) systems and frameworks have been published and designed for parallel DNN training with multiple GPUs. However, most of the existing studies concentrate only on improving the training speed of a single DNN model under centralized or decentralized systems with synchronous or asynchronous approaches. Few works consider the issue of multi-DNN training on the GPU cluster, which is the joint optimization problem of job scheduling and resource allocation. This paper proposes an optimizing makespan and resource utilization (OMRU) approach to minimize job completion time and improve resource utilization for multi-DNN training in a GPU cluster. Specifically, we first collect the training speed/time data of all DNN models by running a job for one epoch on a different number of GPUs. The OMRU algorithm, integrating job scheduling, resource allocation, and GPU reuse strategies, is then devised to minimize the total job completion time (also called makespan) and improve GPU cluster resource utilization. The linear scaling rule (LSR) is adopted for adjusting the learning rate when a DNN model is trained on multiple GPUs with large minibatch size, which can guarantee model accuracy without the other hyper-parameters tune-up. We implement the OMRU algorithm on the Pytorch with Ring-Allreduce communication architecture and a GPU cluster with 8 nodes, each of which has 4 NVIDIA V100 GPUs. Experimental results on image classification and action recognition show that OMRU achieves a makespan reduction of up to 30% compared to the baseline scheduling algorithms and reach an average of 98.4% and 99.2% resource utilization on image classification and action recognition, respectively, with the state-of-the-art model accuracy.

Scheduling Optimization Techniques for Neural Network Training

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

Hybrid Quantum-Classical Scheduling for Accelerating Neural Network Training with Newton's Gradient Descent

Dynamic Space-Time Scheduling for GPU Inference

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

Spatial Sharing of GPU for Autotuning DNN models

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms

Stage-based Hyper-parameter Optimization for Deep Learning

Prophet: Speeding Up Distributed DNN Training with Predictable Communication Scheduling.

A Unified CPU-GPU Protocol for GNN Training

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Optimization of GPU Memory Usage for Training Deep Neural Networks.

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches

An optimal scheduling architecture for accelerating batch algorithms on Neural Network processor architectures

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Optimizing Makespan and Resource Utilization for Multi-Dnn Training in GPU Cluster

DNN Training Acceleration Via Exploring GPGPU Friendly Sparsity

BLAD: Adaptive Load Balanced Scheduling and Operator Overlap Pipeline for Accelerating the Dynamic GNN Training