Network-Aware Distributed Machine Learning Over Wide Area Network

Pan Zhou,Gang Sun,Hongfang Yu,Victor Chang
DOI: https://doi.org/10.1007/978-981-33-6141-6_6
2021-01-01
Abstract:Machine learning requires accessing all dataset to train a high-quality model. Due to the data regulations and privacy concerns, the dataset of different data centers cannot be collected into one data center. It is unavoidable to conduct distributed training across multiple data centers. However, state-of-the-art distributed learning algorithms suffer from high communication cost due to the low-speed, highly heterogeneous wide area network connecting the data centers. In this paper, we propose a novel network-aware decentralized distributed training algorithm, namely NAD-PSGD, to overcome the problem. NAD-PSGD can enable worker nodes to mainly use high-speed links to exchange information and thus significantly reduce communication cost. Through our experiment on Amazon clouds and testbed cluster, NAD-PSGD can reduce the convergence training time by up to 42.8 and 66.9%, in comparison with advanced algorithms AD-PSGD and Allreduce-SGD, respectively.
What problem does this paper attempt to address?