DSANA: A Distributed Machine Learning Acceleration Solution Based on Dynamic Scheduling and Network Acceleration

Runhua Zhang,Guowei Shen,Liangyi Gong,Chun Guo
DOI: https://doi.org/10.1109/hpcc-smartcity-dss50907.2020.00037
2020-01-01
Abstract:Distributed machine learning(DML) has become a feasible solution to deal with the growing training data and models. Reviewing the existing architecture of DML, Parametric server(PS) architecture stands out in iterative convergence algorithms and widely deployed in practice, thanks to flexible expansion and so on. Under this architecture, the parameter synchronization mode based on Bulk Synchronous Parallel(BSP) has become one of the research hotspots. As for the BSP mode, each iteration efficiency is determined by the slowest node in the cluster, therefore, the straggler problem becomes the main reason for reducing the efficiency of DML training, which is even more prominent in the heterogeneous cloud services. Existing works mainly focus on the straggler problem, and the importance of communication is usually ignored. However, inefficient communication is also one of the reasons for the inefficiency of DML iterations. In this paper, we propose DSANA, which first alleviates certain straggler problems by dynamically scheduling computation tasks. Secondly, DSANA improves the overlap of computation/communication by dividing larger transmission parameters, thus further improving the iteration efficiency of DML training. We conduct comparison experiments with the classic iterative algorithm PageRank on four different-scale data sets in two cloud service scenarios. The experimental results show that DSANA can improve the training efficiency to 36.6%$\sim$ 56.4% compared with the baseline solution.
What problem does this paper attempt to address?