Accelerating Distributed Training in Heterogeneous Clusters via a Straggler-Aware Parameter Server

Huihuang Yu,Zongwei Zhu,Xianglan Chen,Yuming Cheng,Yahui Hu,Xi Li
DOI: https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00042
2019-01-01
Abstract:Different from homogeneous clusters, when distributed training is performed in heterogeneous clusters, there will be great performance degradation due to the effect of stragglers. Instead of the synchronous stochastic optimization commonly used in homogeneous clusters, we choose an asynchronous approach, which does not require waiting for stragglers but has the problem of using stale parameters. To solve this problem, we design a straggler-aware parameter server (SaPS), which can detect stragglers through the version of parameters and mitigate their effect by a coordinator which can limit the staleness of parameters without waiting for stragglers. Experimental results show that SaPS can converge faster than fully synchronous, fully asynchronous and some SGD variants.
What problem does this paper attempt to address?