Abstract:Distributed stochastic gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. However, the staggers and limited bandwidth may induce random computational/communication delays, thereby severely hindering the learning process. Therefore, how to accelerate asynchronous SGD by efficiently scheduling multiple workers is an important issue. In this paper, a unified framework is presented to analyze and optimize the convergence of asynchronous SGD based on stochastic delay differential equations (SDDEs) and the Poisson approximation of aggregated gradient arrivals. In particular, we present the run time and staleness of distributed SGD without a memorylessness assumption on the computation times. Given the learning rate, we reveal the relevant SDDE's damping coefficient and its delay statistics, as functions of the number of activated clients, staleness threshold, the eigenvalues of the Hessian matrix of the objective function, and the overall computational/communication delay. The formulated SDDE allows us to present both the distributed SGD's convergence condition and speed by calculating its characteristic roots, thereby optimizing the scheduling policies for asynchronous/event-triggered SGD. It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness. Moreover, a small degree of staleness does not necessarily slow down the convergence, while a large degree of staleness will result in the divergence of distributed SGD. Numerical results demonstrate the potential of our SDDE framework, even in complex learning tasks with non-convex objective functions.

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework

ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters.

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

Dual-Delayed Asynchronous SGD for Arbitrarily Heterogeneous Data

Guided parallelized stochastic gradient descent for delay compensation

A(DP)$^2$SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

On the Convergence of Quantized Parallel Restarted SGD for Central Server Free Distributed Training

Asynchronous Stochastic Gradient Descent with Delay Compensation for Distributed Deep Learning.

SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters

Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Weighted Aggregating Stochastic Gradient Descent for Parallel Deep Learning

DC-S3GD: Delay-Compensated Stale-Synchronous SGD for Large-Scale Decentralized Neural Network Training

A(DP)$^2$2SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

Scaling up stochastic gradient descent for non-convex optimisation

DisSAGD: A Distributed Parameter Update Scheme Based on Variance Reduction

Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning