Abstract:Deep neural network (DNN) has been widely applied in many fields of artificial intelligence (AI), gaining great popularity both in industry and academia. Increasing the size of DNN models does dramatically improve the learning accuracy. However, training large-scale DNN models on a single GPU takes unacceptable waiting time. In order to speed up the training process, many distributed deep learning (DL) systems and frameworks have been published and designed for parallel DNN training with multiple GPUs. However, most of the existing studies concentrate only on improving the training speed of a single DNN model under centralized or decentralized systems with synchronous or asynchronous approaches. Few works consider the issue of multi-DNN training on the GPU cluster, which is the joint optimization problem of job scheduling and resource allocation. This paper proposes an optimizing makespan and resource utilization (OMRU) approach to minimize job completion time and improve resource utilization for multi-DNN training in a GPU cluster. Specifically, we first collect the training speed/time data of all DNN models by running a job for one epoch on a different number of GPUs. The OMRU algorithm, integrating job scheduling, resource allocation, and GPU reuse strategies, is then devised to minimize the total job completion time (also called makespan) and improve GPU cluster resource utilization. The linear scaling rule (LSR) is adopted for adjusting the learning rate when a DNN model is trained on multiple GPUs with large minibatch size, which can guarantee model accuracy without the other hyper-parameters tune-up. We implement the OMRU algorithm on the Pytorch with Ring-Allreduce communication architecture and a GPU cluster with 8 nodes, each of which has 4 NVIDIA V100 GPUs. Experimental results on image classification and action recognition show that OMRU achieves a makespan reduction of up to 30% compared to the baseline scheduling algorithms and reach an average of 98.4% and 99.2% resource utilization on image classification and action recognition, respectively, with the state-of-the-art model accuracy.

Multi-resource Interleaving for Deep Learning Training

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

SCHED²: Scheduling Deep Learning Training Via Deep Reinforcement Learning.

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

UDL: a cloud task scheduling framework based on multiple deep neural networks

Dynamic Resource Allocation for Deep Learning Clusters with Separated Compute and Storage

Work-in-Progress: Furion: Alleviating Overheads for Deep Learning Framework on Single Machine

Deep Reinforcement Learning for Multi-Resource Multi-Machine Job Scheduling

Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

GMI-DRL: Empowering Multi-GPU Deep Reinforcement Learning with GPU Spatial Multiplexing

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

RT-mDL

Optimizing Makespan and Resource Utilization for Multi-Dnn Training in GPU Cluster

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Job-aware Communication Scheduling for DML Training in Shared Cluster