DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

Qinggang Zhou,Yawen Zhang,Pengcheng Li,Xiaoyong Liu,Jun Yang,Runsheng Wang,Ru Huang
2020-01-01
Abstract:The state-of-the-art deep learning algorithms rely on distributed trainingsystems to tackle the increasing sizes of models and training data sets.Minibatch stochastic gradient descent (SGD) algorithm requires workers to haltforward/back propagations, to wait for gradients aggregated from all workers,and to receive weight updates before the next batch of tasks. This synchronousexecution model exposes the overheads of gradient/weight communication among alarge number of workers in a distributed training system. We propose a new SGDalgorithm, DaSGD (Local SGD with Delayed Averaging), which parallelizes SGD andforward/back propagations to hide 100adjusting the gradient update scheme, this algorithm uses hardware resourcesmore efficiently and reduces the reliance on the low-latency andhigh-throughput inter-connects. The theoretical analysis and the experimentalresults show its convergence rate O(1/sqrt(K)), the same as SGD. Theperformance evaluation demonstrates it enables a linear performance scale-upwith the cluster size.
What problem does this paper attempt to address?