Abstract:Deep learning is a growing technique used to solve complex artificial intelligence (AI) problems. Large-scale deep learning has become a significant issue as a result of the expansion of datasets and the complexity of deep learning models. For training large-scale models, the cloud can be used as a distributed HPC (high-performance computing) tool with benefits in cost and flexibility. However, one of the major performance barriers in distributed deep learning in a distributed HPC environment is the network. The performance is often limited by heavy traffic like many stochastic gradient descent transfers for distributed communication. There are many network studies in distributed deep learning to solve these problems, but most research only focuses on improving communication performance and applying new methods or algorithms like overlapping parameter synchronization to minimize communication delay rather than considering the actual network. In this paper, we are focusing on the actual network, especially in a distributed HPC environment. In such an environment, if cluster nodes are assigned to different zones/regions which means a set of an appropriate number of distributed HPC nodes when performing distributed deep learning tasks, performance degradation due to network delay may occur. The proposed network optimization algorithm ensures that distributed work is placed in the same zone as much as possible to reduce network delay. Furthermore, scoring using network monitoring tools like loss, delay, and throughput is applied to select the optimal node within the zone. Our proposal has been validated on the Kubernetes platform, an open source orchestrator for the automatic management and deployment of micro-services. The performance of distributed deep learning is improved through the proposed scheduler.

Dynamic Distribution Strategy of Distributed Tasks Based on Limited Synchronous Parallel

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

An Adaptive Load Balancing Strategy for Distributed Machine Learning

DLB: A Dynamic Load Balance Strategy for Distributed Training of Deep Neural Networks

A Load-Balancing Strategy Based on Multi-Task Learning in a Distributed Training Environment

Accelerating Distributed Learning in Non-Dedicated Environments

An Adaptive Synchronous Parallel Strategy for Distributed Machine Learning

Semi-Dynamic Load Balancing: Efficient Distributed Learning in Non-Dedicated Environments

A Work-Stealing Based Dynamic Load Balancing Algorithm for Conservative Parallel Discrete Event Simulation

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

DLBS: Decentralized load balancing scheme for event-driven cloud frameworks

Load scheduling for distributed edge computing: A communication-computation tradeoff

Distributed dynamic load balancing for task parallel programming

A Load Balancing Strategy for Large-Scale Distributed Computing

Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs over Heterogeneous Infrastructure

Fast Distributed Deep Learning Via Worker-adaptive Batch Sizing

Tasks Distribution Algorithm of Dynamic Scalable Cluster System

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

DeMS: A Hybrid Scheme of Task Scheduling and Load Balancing in Computing Clusters

Online Job Scheduling in Distributed Machine Learning Clusters

DistSim: A Performance Model of Large-Scale Hybrid Distributed DNN Training