Aperiodic Local SGD: Beyond Local SGD.
Hao Zhang,Tingting Wu,Siyao Cheng,Jie Liu
DOI: https://doi.org/10.1145/3545008.3545013
2022-01-01
Abstract:Variations of stochastic gradient decedent (SGD) methods are at the core of training deep neural network models. However, in distributed deep learning, where multiple computing devices and data segments are employed in the training process, the performance of SGD can be significantly limited by the overhead of gradient communication. Local SGD methods are designed to overcome this bottleneck by averaging individual gradients trained over parallel workers after multiple local iterations. Currently, both for theoretical analyses and for practical applications, most studies employ periodic synchronization scheme by default, while few of them focus on the aperiodic schemes to obtain better performance models with limited computation and communication overhead. In this paper, we investigate local SGD with an arbitrary synchronization scheme to answer two questions: (1) Is the periodic synchronization scheme best? (2) If not, what is the optimal one? First, for any synchronization scheme, we derive the performance boundary with fixed overhead, and formulate the performance optimization under given computation and communication constraints. Then we find a succinct property of the optimal scheme that the local iteration number decreases as training continues, which indicates the periodic one is suboptimal. Furthermore, with some reasonable approximations, we obtain an explicit form of the optimal scheme and propose Aperiodic Local SGD (ALSGD) as an improved substitute for local SGD without any overhead increment. Our experiments also confirm that with the same computation and communication overhead, ALSGD outperforms local SGD in performance, especially for heterogeneous data.