Abstract:Deep learning is a growing technique used to solve complex artificial intelligence (AI) problems. Large-scale deep learning has become a significant issue as a result of the expansion of datasets and the complexity of deep learning models. For training large-scale models, the cloud can be used as a distributed HPC (high-performance computing) tool with benefits in cost and flexibility. However, one of the major performance barriers in distributed deep learning in a distributed HPC environment is the network. The performance is often limited by heavy traffic like many stochastic gradient descent transfers for distributed communication. There are many network studies in distributed deep learning to solve these problems, but most research only focuses on improving communication performance and applying new methods or algorithms like overlapping parameter synchronization to minimize communication delay rather than considering the actual network. In this paper, we are focusing on the actual network, especially in a distributed HPC environment. In such an environment, if cluster nodes are assigned to different zones/regions which means a set of an appropriate number of distributed HPC nodes when performing distributed deep learning tasks, performance degradation due to network delay may occur. The proposed network optimization algorithm ensures that distributed work is placed in the same zone as much as possible to reduce network delay. Furthermore, scoring using network monitoring tools like loss, delay, and throughput is applied to select the optimal node within the zone. Our proposal has been validated on the Kubernetes platform, an open source orchestrator for the automatic management and deployment of micro-services. The performance of distributed deep learning is improved through the proposed scheduler.

Distributed Machine Learning Based Link Allocation Strategy

Network-Aware Distributed Machine Learning Over Wide Area Network

Graph Embedding Based Wireless Link Scheduling with Few Training Samples

Efficient Collaborative Learning over Unreliable D2D Network: Adaptive Cluster Head Selection and Resource Allocation

Distributed Learning of Predictive Structures from Multiple Tasks over Networks

Online Job Scheduling in Distributed Machine Learning Clusters

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

Decentralized Edge Learning via Unreliable Device-to-Device Communications

Adaptive Cluster Head Selection and Spectrum Allocation for D2D-Enabled Collaborative Learning

Distributed Machine Learning with Strategic Network Design: A Game-Theoretic Perspective

A Distributed and Scalable Machine Learning Approach for Big Data

Optimize Resource Placement for In-Network Computing

A Hybrid Data and Model Transfer Framework for Distributed Machine Learning

A Survey From Distributed Machine Learning to Distributed Deep Learning

Available Bandwidth Based Dynamic Load Balancing over Multiple Links

D2D-Enabled Data Sharing for Distributed Machine Learning at Wireless Network Edge

Efficient Fully Distributed Federated Learning with Adaptive Local Links

Optimal Data Splitting in Distributed Optimization for Machine Learning

Learning-Based User Clustering and Link Allocation for Content Recommendation Based on D2D Multicast Communications

Distributed Machine Learning for Wireless Communication Networks: Techniques, Architectures, and Applications

Efficient Distributed Machine Learning with Trigger Driven Parallel Training.