Abstract:Deep neural network (DNN) has been widely applied in many fields of artificial intelligence (AI), gaining great popularity both in industry and academia. Increasing the size of DNN models does dramatically improve the learning accuracy. However, training large-scale DNN models on a single GPU takes unacceptable waiting time. In order to speed up the training process, many distributed deep learning (DL) systems and frameworks have been published and designed for parallel DNN training with multiple GPUs. However, most of the existing studies concentrate only on improving the training speed of a single DNN model under centralized or decentralized systems with synchronous or asynchronous approaches. Few works consider the issue of multi-DNN training on the GPU cluster, which is the joint optimization problem of job scheduling and resource allocation. This paper proposes an optimizing makespan and resource utilization (OMRU) approach to minimize job completion time and improve resource utilization for multi-DNN training in a GPU cluster. Specifically, we first collect the training speed/time data of all DNN models by running a job for one epoch on a different number of GPUs. The OMRU algorithm, integrating job scheduling, resource allocation, and GPU reuse strategies, is then devised to minimize the total job completion time (also called makespan) and improve GPU cluster resource utilization. The linear scaling rule (LSR) is adopted for adjusting the learning rate when a DNN model is trained on multiple GPUs with large minibatch size, which can guarantee model accuracy without the other hyper-parameters tune-up. We implement the OMRU algorithm on the Pytorch with Ring-Allreduce communication architecture and a GPU cluster with 8 nodes, each of which has 4 NVIDIA V100 GPUs. Experimental results on image classification and action recognition show that OMRU achieves a makespan reduction of up to 30% compared to the baseline scheduling algorithms and reach an average of 98.4% and 99.2% resource utilization on image classification and action recognition, respectively, with the state-of-the-art model accuracy.

Rationing Bandwidth Resources for Mitigating Network Resource Contention in Distributed DNN Training Clusters.

Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

Benchmarking Resource Usage for Efficient Distributed Deep Learning

MLTCP: Congestion Control for DNN Training

DDPQN: An Efficient DNN Offloading Strategy in Local-Edge-Cloud Collaborative Environments

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

Accelerating Distributed DNN Training via Transport Layer Scheduling

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes

Energy-Aware Workload Allocation for Distributed Deep Neural Networks in Edge-Cloud Continuum.

Optimizing Makespan and Resource Utilization for Multi-Dnn Training in GPU Cluster

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud