Abstract:Deep learning is a growing technique used to solve complex artificial intelligence (AI) problems. Large-scale deep learning has become a significant issue as a result of the expansion of datasets and the complexity of deep learning models. For training large-scale models, the cloud can be used as a distributed HPC (high-performance computing) tool with benefits in cost and flexibility. However, one of the major performance barriers in distributed deep learning in a distributed HPC environment is the network. The performance is often limited by heavy traffic like many stochastic gradient descent transfers for distributed communication. There are many network studies in distributed deep learning to solve these problems, but most research only focuses on improving communication performance and applying new methods or algorithms like overlapping parameter synchronization to minimize communication delay rather than considering the actual network. In this paper, we are focusing on the actual network, especially in a distributed HPC environment. In such an environment, if cluster nodes are assigned to different zones/regions which means a set of an appropriate number of distributed HPC nodes when performing distributed deep learning tasks, performance degradation due to network delay may occur. The proposed network optimization algorithm ensures that distributed work is placed in the same zone as much as possible to reduce network delay. Furthermore, scoring using network monitoring tools like loss, delay, and throughput is applied to select the optimal node within the zone. Our proposal has been validated on the Kubernetes platform, an open source orchestrator for the automatic management and deployment of micro-services. The performance of distributed deep learning is improved through the proposed scheduler.

GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters.

TIAS: Two-level Information-Agnostic Job Scheduling in GPU Clusters

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

GPU Cluster Scheduling for Network-Sensitive Deep Learning

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning

Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A Multi-Agent Reinforcement Learning Approach

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads

Dynamic Resource Allocation for Deep Learning Clusters with Separated Compute and Storage

HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud Via Reinforcement Learning

Interference-aware parallelization for deep learning workload in GPU cluster

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

Multi-Agent Deep Reinforcement Learning-Based Resource Allocation in HPC/AI Converged Cluster