Live Gradient Compensation for Evading Stragglers in Distributed Learning

Jian Xu,Shao-Lun Huang,Linqi Song,Tian Lan
DOI: https://doi.org/10.1109/infocom42981.2021.9488815
2021-01-01
Abstract:The training efficiency of distributed learning systems is vulnerable to stragglers, namely, those slow worker nodes. A naive strategy is performing the distributed learning by incorporating the fastest K workers and ignoring these stragglers, which may induce high deviation for non-IID data. To tackle this, we develop a Live Gradient Compensation (LGC) strategy to incorporate the one-step delayed gradients from stragglers, aiming to accelerate learning process and utilize the stragglers simultaneously. In LGC framework, mini-batch data are divided into smaller blocks and processed separately, which makes the gradient computed based on partial work accessible. In addition, we provide theoretical convergence analysis of our algorithm for non-convex optimization problem under non-IID training data to show that LGC-SGD has almost the same convergence error as full synchronous SGD. The theoretical results also allow us to quantify a novel tradeoff in minimizing training time and error by selecting the optimal straggler threshold. Finally, extensive simulation experiments of image classification on CIFAR-10 dataset are conducted, and the numerical results demonstrate the effectiveness of our proposed strategy.
What problem does this paper attempt to address?