Abstract:Distributed machine learning including federated learning has attracted considerable attention due to its potential of scaling the computational resources, reducing the training time, and helping protect the user privacy. As one of key enablers of distributed learning, asynchronous optimization allows multiple workers to process data simultaneously without paying a cost of synchronization delay. However, given limited communication bandwidth, asynchronous optimization can be hampered by gradient staleness, which severely hinders the learning process. In this paper, we present a communication-constrained distributed learning scheme, in which asynchronous stochastic gradients generated by parallel workers are transmitted over a shared medium or link. Our aim is to minimize the average training time by striking the optimal tradeoff between the number of parallel workers and their gradient staleness. To this end, a queueing theoretic model is formulated, which allows us to find the optimal number of workers participating in the asynchronous optimization. Furthermore, we also leverage the packet arrival time at the parameter server, also referred to as Timing Side Information (TSI), to compress the staleness information for the stalenessaware Asynchronous Stochastic Gradients Descent (Asyn-SGD). Numerical results demonstrate the substantial reduction of training time owing to both the worker selection and TSI-aided compression of staleness information.

Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient

Toward Communication Efficient Adaptive Gradient Method

Sparse Communication for Training Deep Networks

Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Sparse Gradient Compression For Distributed Sgd

Gradient Sparsification for Communication-Efficient Distributed Optimization

Communication-Censored Distributed Stochastic Gradient Descent

Communication-Constrained Distributed Learning: TSI-Aided Asynchronous Optimization with Stale Gradient

Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods

Communication-Efficient and Byzantine-Robust Distributed Stochastic Learning with Arbitrary Number of Corrupted Workers

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

Accelerated Primal-Dual Algorithms for Distributed Smooth Convex Optimization over Networks

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Lazily Aggregated Quantized Gradient Innovation for Communication-Efficient Federated Learning.

A Communication-Efficient Stochastic Gradient Descent Algorithm for Distributed Nonconvex Optimization

STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load

CADA: Communication-Adaptive Distributed Adam

Adaptive Consensus Gradients Aggregation for Scaled Distributed Training

Can We Learn Communication-Efficient Optimizers?