Abstract:To afford the huge computational cost, large-scale deep neural networks (DNN) are usually trained on the distributed system, especially the widely-used parameter server architecture, consisting of a parameter server as well as multiple local workers with powerful GPU cards. During the training, local workers frequently pull the global model and push their computed gradients from/to the parameter server. Due to the limited bandwidth, such frequent communication will cause severe bottleneck for the training acceleration. As recent attempts to address this problem, quantization methods have been proposed to compress the gradients for efficient communication. However, such methods overlook the effects of compression on the model performance such that they either suffer from a low compression ratio or an accuracy drop. In this paper, to better address this problem, we investigate the distributed deep learning as a multi-agent system (MAS) problem. Specifically, 1) local workers and the parameter server are separate agents in the system; 2) the objective of these agents is to maximize the efficacy of the learned model through their cooperative interactions; 3) the strategy of the agents describes how they take actions, i.e. communicate their computed gradients or the global model; 4) rational agents always select the best-response strategy with the optimal utility. Inspired by this, we design a MAS approach for distributed training of DNN. In our method, the agents first estimate the utility (i.e., the benefit to help improve the model) of each action (i.e., transferring a subset of the gradients or the global model), and then take the best-response strategy based on their estimated utilities mixed with e-random exploration. We call our new method Slim-DP as it, being different from the standard data-parallelism, only communicates a subset of the gradient or the global model. Our experimental results demonstrate that our proposed Slim-DP can reduce more communication cost and achieve better speedup without loss of accuracy than the standard data parallelism and its quantization version.

Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs over Heterogeneous Infrastructure

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Coded Parallelism for Distributed Deep Learning.

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

HeterPS: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

Distributed Training Optimization for DCU

DynaHB: A Communication-Avoiding Asynchronous Distributed Framework with Hybrid Batches for Dynamic GNN Training

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Joint DNN Partition and Resource Allocation Optimization for Energy-Constrained Hierarchical Edge-Cloud Systems

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Distributed Newton Methods for Deep Neural Networks

Online Job Scheduling in Distributed Machine Learning Clusters

Slim-DP: A Multi-Agent System for Communication-Efficient Distributed Deep Learning