RedSync: Reducing Synchronization Bandwidth for Distributed Deep Learning Training System

Jiarui Fang,Haohuan Fu,Guangwen Yang,Cho-Jui Hsieh
DOI: https://doi.org/10.1016/j.jpdc.2019.05.016
IF: 4.542
2019-01-01
Journal of Parallel and Distributed Computing
Abstract:Data parallelism has become a dominant method to scale Deep Neural Network (DNN) training across multiple nodes. Since the bandwidth requirement of synchronizing the gradients of the local model can be a bottleneck for large-scale distributed training, compressing communication traffic has gained widespread attention recently. Among several recent proposed compression algorithms, Residual Gradient Compression (RGC) is one of the most successful approaches-it can significantly compress the transmitting message size (0.1% of the gradient size) of each node and still achieve correct accuracy and the same convergence speed. However, the literature on compressing deep networks focuses almost exclusively on achieving good theoretical compression rate, while the efficiency of RGC in real implementation has been less investigated. In this paper, we develop an RGC method that is able to reduce the end-to-end training time on real-world multi-GPU systems. Our proposed RGC system design called RedSync, introduces a set of optimizations to reduce communication bandwidth while introducing limited overhead. We examine the performance of RedSync on two different multiple GPU platforms, including 128 GPUs of a supercomputer and an 8-GPU server. Our test cases include image classification on Cifar10 and ImageNet, and language modeling tasks on Penn Treebank and Wiki2 datasets. For DNNs featured with high communication to computation ratio, which has long been considered with poor scalability, RedSync shows significant performance improvement. (C) 2019 Elsevier Inc. All rights reserved.
What problem does this paper attempt to address?