Adaptive Load Balancing for Parameter Servers in Distributed Machine Learning over Heterogeneous Networks

CAI Weibo,YANG Shulin,SUN Gang,ZHANG Qiming,YU Hongfang
DOI: https://doi.org/10.12142/ztecom.202301009
2023-01-01
Abstract:In distributed machine learning (DML) based on the parameter server (PS) architecture, unbalanced communication load distribu-tion of PSs will lead to a significant slowdown of model synchronization in heterogeneous networks due to low utilization of bandwidth . To ad-dress this problem, a network-aware adaptive PS load distribution scheme is proposed, which accelerates model synchronization by proac-tively adjusting the communication load on PSs according to network states. We evaluate the proposed scheme on MXNet, known as a real-world distributed training platform, and results show that our scheme achieves up to 2.68 times speed-up of model training in the dynamic and heterogeneous network environment.
What problem does this paper attempt to address?