Abstract:To afford the huge computational cost, large-scale deep neural networks (DNN) are usually trained on the distributed system, especially the widely-used parameter server architecture, consisting of a parameter server as well as multiple local workers with powerful GPU cards. During the training, local workers frequently pull the global model and push their computed gradients from/to the parameter server. Due to the limited bandwidth, such frequent communication will cause severe bottleneck for the training acceleration. As recent attempts to address this problem, quantization methods have been proposed to compress the gradients for efficient communication. However, such methods overlook the effects of compression on the model performance such that they either suffer from a low compression ratio or an accuracy drop. In this paper, to better address this problem, we investigate the distributed deep learning as a multi-agent system (MAS) problem. Specifically, 1) local workers and the parameter server are separate agents in the system; 2) the objective of these agents is to maximize the efficacy of the learned model through their cooperative interactions; 3) the strategy of the agents describes how they take actions, i.e. communicate their computed gradients or the global model; 4) rational agents always select the best-response strategy with the optimal utility. Inspired by this, we design a MAS approach for distributed training of DNN. In our method, the agents first estimate the utility (i.e., the benefit to help improve the model) of each action (i.e., transferring a subset of the gradients or the global model), and then take the best-response strategy based on their estimated utilities mixed with e-random exploration. We call our new method Slim-DP as it, being different from the standard data-parallelism, only communicates a subset of the gradient or the global model. Our experimental results demonstrate that our proposed Slim-DP can reduce more communication cost and achieve better speedup without loss of accuracy than the standard data parallelism and its quantization version.

Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

Priority-based Parameter Propagation for Distributed DNN Training

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Coded Parallelism for Distributed Deep Learning.

OSP: Boosting Distributed Model Training with 2-Stage Synchronization

HiPS - Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning.

Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation

A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

Distributed Learning of Predictive Structures from Multiple Tasks over Networks

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical Topology

Accelerating Geo-distributed Machine Learning with Network-Aware Adaptive Tree and Auxiliary Route

Rina: Enhancing Ring-AllReduce with In-network Aggregation in Distributed Model Training

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

Slim-DP: A Multi-Agent System for Communication-Efficient Distributed Deep Learning

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models