Abstract:With the widespread use of GPUs for performing deep learning applications, the issue of efficient execution of multiple deep learning jobs in a GPU cluster has attracted great attention. It becomes more difficult to achieve efficient workloads parallelization since modern GPUs support concurrent execution of multiple jobs. However, traditional coarse-grained scheduling methods without taking into account interference caused by resource contention among co-executing jobs and characteristics of deep learning jobs can lead to unbalanced use of computing resource and further cause the degradation of jobs performance in the GPU cluster. In this paper, we propose a two-stage workload parallelization approach for deep learning training workloads. We firstly propose two interference-aware prediction models including the Interference-Aware Similarity Prediction (IASP) model based on deep collaborative filtering and the Interference-Aware Performance Prediction (IAPP) model based on deep neural network. Our parallelization approach includes both the cluster-level workload parallelization strategy and the node-level workload parallelization strategy. Specifically, the Cluster-Level Workload Parallelization (CLWP) strategy assigns deep learning jobs to appropriate worker node according to the proposed IASP model, and the Node-Level Workload Parallelization (NLWP) strategy places deep learning tasks to appropriate GPUs according to the proposed IAPP model and the communication costs among tasks. We evaluate our deep learning workload parallelization strategy on a prototype platform with other widely used methods. The experimental results show that the proposed strategy can averagely improve the GPU utilization by 18% and reduce the job completion time by around 22%.

Hybrid Parameter Update: Alleviating Imbalance Impacts for Distributed Deep Learning.

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters

Petrel: Community-Aware Synchronous Parallel For Heterogeneous Parameter Server

Dynamic Delay Based Cyclic Gradient Update Method for Distributed Training.

Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

Petrel: Heterogeneity-Aware Distributed Deep Learning Via Hybrid Synchronization

Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

Accelerating Distributed Training in Heterogeneous Clusters via a Straggler-Aware Parameter Server

Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs over Heterogeneous Infrastructure

Online Job Scheduling in Distributed Machine Learning Clusters

Efficient Communication Scheduling for Parameter Synchronization of DML in Data Center Networks

Heterogeneity-Aware Distributed Parameter Servers

Interference-aware parallelization for deep learning workload in GPU cluster

Adaptive Load Balancing for Parameter Servers in Distributed Machine Learning over Heterogeneous Networks

FedGSync: Jointly Optimized Weak Synchronization and Gradient Transmission for Fast Distributed Machine Learning in Heterogeneous WAN

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

A scalable and topology configurable protocol for distributed parameter synchronization

Grouper: Accelerating Hyperparameter Searching in Deep Learning Clusters with Network Scheduling