Abstract:Distributed machine learning is a mainstream system to learn insights for analytics and intelligence services of many fronts (e.g., health, streaming and business) from their massive operational data. In such a system, multiple workers train over subsets of data and collaboratively derive a global prediction/inference model by iteratively synchronizing their local learning results, e.g., the model gradients, which in turn generates heavy and bursty traffic and results in high communication overhead in cluster networks. Such communication overhead has became the main bottleneck that limits the efficiency of training machine learning models distributedly. In this paper, our key observation is that local gradients learned by workers may have different contributions to global model convergence and executing differential transmission for different gradients can reduce the communication overhead and improve training efficiency. However, existing gradient transmission mechanisms treat all gradients the same, which may lead to long training time. Motivated by our observations, we propose Differential Gradient Transmission (DGT), a contributionaware differential gradient transmission mechanism for efficient distributed learning, which transfers gradients with different transmission quality according to their contributions. In addition to designing a general architecture of DGT, we have proposed a novel algorithm and a novel protocol to facilitate fast model training. Experiments on a cluster with 6 GTX 1080TI GPUs and 1Gbps network show that DGT decreases the model training time by 19.4% on GoogleNet, 34.4% on AlexNet and 36.5% on VGG-11 compared to default gradient transmission on MXNET. Its acceleration is better than the other two related transmission solutions. Besides, DGT works well with different datasets (Fashion-MNIST, Cifar10), different data distributions (IID, non-IID) and different training algorithms (BSP, FedAVG). (C) 2021 Elsevier B.V. All rights reserved.

Distributed Deep Neural Network Training with Important Gradient Filtering, Delayed Update and Static Filtering.

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

GRID: Gradient Routing with In-Network Aggregation for Distributed Training

Sparse Communication for Training Deep Networks

A Hierarchical Communication Algorithm for Distributed Deep Learning Training.

MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms.

MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning

Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training

Toward Communication Efficient Adaptive Gradient Method

Sparse Gradient Compression For Distributed Sgd

An Efficient Bandwidth-Adaptive Gradient Compression Algorithm for Distributed Training of Deep Neural Networks

Dynamic Delay Based Cyclic Gradient Update Method for Distributed Training.

A Partition Based Gradient Compression Algorithm for Distributed Training in AIoT

Distributed Newton Methods for Deep Neural Networks

DGT: A Contribution-Aware Differential Gradient Transmission Mechanism for Distributed Machine Learning

RedSync: Reducing Synchronization Bandwidth for Distributed Deep Learning Training System

Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient

Peering Beyond the Gradient Veil with Distributed Auto Differentiation

An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems

DISTRIBUTED HIGH-PERFORMANCE COMPUTING METHODS FOR ACCELERATING DEEP LEARNING TRAINING