Abstract:Deep neural network (DNN) has been widely applied in many fields of artificial intelligence (AI), gaining great popularity both in industry and academia. Increasing the size of DNN models does dramatically improve the learning accuracy. However, training large-scale DNN models on a single GPU takes unacceptable waiting time. In order to speed up the training process, many distributed deep learning (DL) systems and frameworks have been published and designed for parallel DNN training with multiple GPUs. However, most of the existing studies concentrate only on improving the training speed of a single DNN model under centralized or decentralized systems with synchronous or asynchronous approaches. Few works consider the issue of multi-DNN training on the GPU cluster, which is the joint optimization problem of job scheduling and resource allocation. This paper proposes an optimizing makespan and resource utilization (OMRU) approach to minimize job completion time and improve resource utilization for multi-DNN training in a GPU cluster. Specifically, we first collect the training speed/time data of all DNN models by running a job for one epoch on a different number of GPUs. The OMRU algorithm, integrating job scheduling, resource allocation, and GPU reuse strategies, is then devised to minimize the total job completion time (also called makespan) and improve GPU cluster resource utilization. The linear scaling rule (LSR) is adopted for adjusting the learning rate when a DNN model is trained on multiple GPUs with large minibatch size, which can guarantee model accuracy without the other hyper-parameters tune-up. We implement the OMRU algorithm on the Pytorch with Ring-Allreduce communication architecture and a GPU cluster with 8 nodes, each of which has 4 NVIDIA V100 GPUs. Experimental results on image classification and action recognition show that OMRU achieves a makespan reduction of up to 30% compared to the baseline scheduling algorithms and reach an average of 98.4% and 99.2% resource utilization on image classification and action recognition, respectively, with the state-of-the-art model accuracy.

pommDNN: Performance optimal GPU memory management for deep neural network training

Accelerating Neural Network Training with Processing-in-Memory GPU

TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training Via Tensor Splitting

MegTaiChi: Dynamic Tensor-based Memory Management Optimization for DNN Training

STR: Hybrid Tensor Re-Generation to Break Memory Wall for DNN Training

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

An Application-oblivious Memory Scheduling System for DNN Accelerators

Efficient Memory Management for GPU-based Deep Learning Systems

A Swap Dominated Tensor Re-Generation Strategy for Training Deep Learning Models

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

Enabling Large Batch Size Training for DNN Models Beyond the Memory Limit While Maintaining Performance

Accelerating Tensor Swapping in GPUs with Self-Tuning Compression

Optimizing Makespan and Resource Utilization for Multi-Dnn Training in GPU Cluster

G-NMP: Accelerating Graph Neural Networks with DIMM-based Near-Memory Processing

vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

HOME: A Holistic GPU Memory Management Framework for Deep Learning

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Optimization of GPU Memory Usage for Training Deep Neural Networks.

Pinpointing the Memory Behaviors of DNN Training

SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks