DGT: A Contribution-Aware Differential Gradient Transmission Mechanism for Distributed Machine Learning
Huaman Zhou,Zonghang Li,Qingqing Cai,Hongfang Yu,Shouxi Luo,Long Luo,Gang Sun
DOI: https://doi.org/10.1016/j.future.2021.03.006
2021-01-01
Abstract:Distributed machine learning is a mainstream system to learn insights for analytics and intelligence services of many fronts (e.g., health, streaming and business) from their massive operational data. In such a system, multiple workers train over subsets of data and collaboratively derive a global prediction/inference model by iteratively synchronizing their local learning results, e.g., the model gradients, which in turn generates heavy and bursty traffic and results in high communication overhead in cluster networks. Such communication overhead has became the main bottleneck that limits the efficiency of training machine learning models distributedly. In this paper, our key observation is that local gradients learned by workers may have different contributions to global model convergence and executing differential transmission for different gradients can reduce the communication overhead and improve training efficiency. However, existing gradient transmission mechanisms treat all gradients the same, which may lead to long training time. Motivated by our observations, we propose Differential Gradient Transmission (DGT), a contributionaware differential gradient transmission mechanism for efficient distributed learning, which transfers gradients with different transmission quality according to their contributions. In addition to designing a general architecture of DGT, we have proposed a novel algorithm and a novel protocol to facilitate fast model training. Experiments on a cluster with 6 GTX 1080TI GPUs and 1Gbps network show that DGT decreases the model training time by 19.4% on GoogleNet, 34.4% on AlexNet and 36.5% on VGG-11 compared to default gradient transmission on MXNET. Its acceleration is better than the other two related transmission solutions. Besides, DGT works well with different datasets (Fashion-MNIST, Cifar10), different data distributions (IID, non-IID) and different training algorithms (BSP, FedAVG). (C) 2021 Elsevier B.V. All rights reserved.