EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Zeyu Ji,Xingjun Zhang,Jingbo Li,Jia Wei,Zheng Wei
DOI: https://doi.org/10.1007/s11227-022-04466-8
IF: 3.3
2022-04-21
The Journal of Supercomputing
Abstract:Driven by big data, neural networks evolve more complex and the computing capacity of a single machine is often difficult to meet the demand. Distributed deep learning technology has shown great performance superiority for handling this problem. However, a serious issue in this field is the existence of stragglers, which significantly restricts the performance of the whole system. It is an enormous challenge to fully exploit the computing capacity of the system based on parameter server architecture, especially in a heterogeneous environment. Motivated by this, we designed a method named EP4DDL to minimize the impact of the straggler problem by load balance technique. In a statistical view, the approach introduces a novel metric named performance variance to give a comprehensive inspection of stragglers and employs flexible parallelism techniques for each node. We verify the algorithm on standard benchmarks and demonstrate that it can reduce training time to 57.46%, 24.8%, and 11.5%, respectively, without accuracy loss compared with the FlexRR, Con-SGD, and Falcon.
What problem does this paper attempt to address?