Abstract:The demand for large-scale deep learning is increasing, and distributed training is the current mainstream solution. Ring AllReduce is widely used as a data parallel decentralized algorithm. However, in a heterogeneous environment, each worker calculates the same amount of data, so that there is a lot of waiting time loss among different workers, which makes the algorithm unable to adapt well to heterogeneous clusters. Resources are not used as they should be. In this paper, we design an implementation of static allocation algorithm. The dataset is artificially allocated to each worker, and samples are drawn proportionally for training, thereby speeding up the training speed of the network in a heterogeneous environment. We verify the convergence and influence on training speed of the network model under this algorithm on one machine with multi-card and multi-machine with multi-card. On this basis of feasibility, we propose a self-adaptive allocation algorithm that allows each machine to find the data it needs to adapt to the current environment. The self-adaptive allocation algorithm can reduce the training time by nearly one-third to half compared to the same proportional <a class="link-external link-http" href="http://allocation.In" rel="external noopener nofollow">this http URL</a> order to better show the applicability of the algorithm in heterogeneous clusters, We replace a poorly performing worker with a good performing worker or add a poorly performing worker to the heterogeneous cluster. Experimental results show that training time will decrease as the overall performance improves. Therefore, it means that resources are fully used. Further, this algorithm is not only suitable for straggler problems, but also for most heterogeneous situations. It can be used as a plug-in for AllReduce and its variant algorithms.

Training Job Placement in Clusters with Statistical In-Network Aggregation

Enabling Switch Memory Management for Distributed Training with In-Network Aggregation.

In-Network Aggregation with Transport Transparency for Distributed Training

Preemptive Switch Memory Usage to Accelerate Training Jobs with Shared In-Network Aggregation

Identifying Performance Bottleneck in Shared In-Network Aggregation During Distributed Training

AggTree: A Routing Tree with In-Network Aggregation for Distributed Training

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Efficient Data-Plane Memory Scheduling for In-Network Aggregation

Optimize Resource Placement for In-Network Computing

ATP: In-network Aggregation for Multi-tenant Learning.

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

TIAS: Two-level Information-Agnostic Job Scheduling in GPU Clusters

Task allocation for decentralized training in heterogeneous environment

Optimization of Topology-Aware Job Allocation on a High-Performance Computing Cluster by Neural Simulated Annealing

Rina: Enhancing Ring-AllReduce with In-network Aggregation in Distributed Model Training

Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

N4: Network for N Neural Network Training

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Energy-Efficient GPU Clusters Scheduling for Deep Learning

AINNS: All-Inclusive Neural Network Scheduling via Accelerator Formalization

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training