Abstract:To afford the huge computational cost, large-scale deep neural networks (DNN) are usually trained on the distributed system, especially the widely-used parameter server architecture, consisting of a parameter server as well as multiple local workers with powerful GPU cards. During the training, local workers frequently pull the global model and push their computed gradients from/to the parameter server. Due to the limited bandwidth, such frequent communication will cause severe bottleneck for the training acceleration. As recent attempts to address this problem, quantization methods have been proposed to compress the gradients for efficient communication. However, such methods overlook the effects of compression on the model performance such that they either suffer from a low compression ratio or an accuracy drop. In this paper, to better address this problem, we investigate the distributed deep learning as a multi-agent system (MAS) problem. Specifically, 1) local workers and the parameter server are separate agents in the system; 2) the objective of these agents is to maximize the efficacy of the learned model through their cooperative interactions; 3) the strategy of the agents describes how they take actions, i.e. communicate their computed gradients or the global model; 4) rational agents always select the best-response strategy with the optimal utility. Inspired by this, we design a MAS approach for distributed training of DNN. In our method, the agents first estimate the utility (i.e., the benefit to help improve the model) of each action (i.e., transferring a subset of the gradients or the global model), and then take the best-response strategy based on their estimated utilities mixed with e-random exploration. We call our new method Slim-DP as it, being different from the standard data-parallelism, only communicates a subset of the gradient or the global model. Our experimental results demonstrate that our proposed Slim-DP can reduce more communication cost and achieve better speedup without loss of accuracy than the standard data parallelism and its quantization version.

DPS: A DSM-based Parameter Server for Machine Learning

DRPS: Efficient Disk-Resident Parameter Servers for Distributed Machine Learning.

Accelerating Distributed Machine Learning by Smart Parameter Server

KunPeng: Parameter Server Based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

FluentPS: A Parameter Server Design with Low-frequency Synchronization for Distributed Deep Learning

PS2: Parameter Server on Spark

Adaptive Load Balancing for Parameter Servers in Distributed Machine Learning over Heterogeneous Networks

HotML: A DSM-based Machine Learning System for Social Networks

Scalable Learning and Probabilistic Analytics of Industrial Big Data Based on Parameter Server: Framework, Methods and Applications

P/D-Serve: Serving Disaggregated Large Language Model at Scale

H-PS: A Heterogeneous-Aware Parameter Server With Distributed Neural Network Training

Petrel: Community-Aware Synchronous Parallel For Heterogeneous Parameter Server

Elastic Model Aggregation with Parameter Service

JointPS: Joint Parameter Server Placement and Flow Scheduling for Machine Learning Clusters

A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors

Slim-DP: A Multi-Agent System for Communication-Efficient Distributed Deep Learning

PSscheduler: A Parameter Synchronization Scheduling Algorithm for Distributed Machine Learning in Reconfigurable Optical Networks

Efficient Communication Scheduling for Parameter Synchronization of DML in Data Center Networks

HiPS - Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning.

DyPS: Dynamic Parameter Sharing in Multi-Agent Reinforcement Learning for Spatio-Temporal Resource Allocation