Abstract:In machine learning, asynchronous parallel stochastic gradient descent(APSGD) is broadly used to speed up the training process through multi-workers.Meanwhile, the time delay of stale gradients in asynchronous algorithms isgenerally proportional to the total number of workers, which brings additionaldeviation from the accurate gradient due to using delayed gradients. This mayhave a negative influence on the convergence of the algorithm. One may ask: Howmany workers can we use at most to achieve a good convergence and the linearspeedup? In this paper, we consider the second-order convergence of asynchronousalgorithms in non-convex optimization. We investigate the behaviors of APSGDwith consistent read near strictly saddle points and provide a theoreticalguarantee that if the total number of workers is bounded byO(K^1/3M^-1/3) (K is the total steps and M is themini-batch size), APSGD will converge to good stationary points (||∇f(x)||≤ϵ, ∇^2 f(x)≽ -√(ϵ)I,ϵ^2≤ O(√(1/MK))) and the linear speedup is achieved.Our works give the first theoretical guarantee on the second-order convergencefor asynchronous algorithms. The technique we provide can be generalized toanalyze other types of asynchronous algorithms to understand the behaviors ofasynchronous algorithms in distributed asynchronous parallel training.

On the Convergence of Perturbed Distributed Asynchronous Stochastic Gradient Descent to Second Order Stationary Points in Non-convex Optimization.