Abstract:Distributed stochastic gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. However, the staggers and limited bandwidth may induce random computational/communication delays, thereby severely hindering the learning process. Therefore, how to accelerate asynchronous SGD by efficiently scheduling multiple workers is an important issue. In this paper, a unified framework is presented to analyze and optimize the convergence of asynchronous SGD based on stochastic delay differential equations (SDDEs) and the Poisson approximation of aggregated gradient arrivals. In particular, we present the run time and staleness of distributed SGD without a memorylessness assumption on the computation times. Given the learning rate, we reveal the relevant SDDE's damping coefficient and its delay statistics, as functions of the number of activated clients, staleness threshold, the eigenvalues of the Hessian matrix of the objective function, and the overall computational/communication delay. The formulated SDDE allows us to present both the distributed SGD's convergence condition and speed by calculating its characteristic roots, thereby optimizing the scheduling policies for asynchronous/event-triggered SGD. It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness. Moreover, a small degree of staleness does not necessarily slow down the convergence, while a large degree of staleness will result in the divergence of distributed SGD. Numerical results demonstrate the potential of our SDDE framework, even in complex learning tasks with non-convex objective functions.

Distributed Stochastic Optimization with Random Communication and Computational Delays: Optimal Policies and Performance Analysis

Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework

Communication-Constrained Distributed Learning: TSI-Aided Asynchronous Optimization with Stale Gradient

Delayed Stochastic Algorithms for Distributed Weakly Convex Optimization

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Convergence Analysis of Asynchronous Stochastic Recursive Gradient Algorithms

Accelerated Distributed Stochastic Non-Convex Optimization over Time-Varying Directed Networks

Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning

Efficient Byzantine-Resilient Stochastic Gradient Desce

A Communication-Efficient Stochastic Gradient Descent Algorithm for Distributed Nonconvex Optimization

Decentralized Optimization in Networks with Arbitrary Delays

Convergence in High Probability of Distributed Stochastic Gradient Descent Algorithms

On the Communication Complexity of Decentralized Bilevel Optimization

A Continuous-Time Analysis of Distributed Stochastic Gradient

STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

Asynchronous Decentralized Accelerated Stochastic Gradient Descent

A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient

Scaling up stochastic gradient descent for non-convex optimisation

Stochastic Optimization with Decision-Dependent Distributions

Asynchronous Distributed Optimization with Delay-free Parameters