Abstract:Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combined with a parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exist two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to a decrease in convergence accuracy. In this paper, we combine parallel mechanism with gradient quantization and periodic full-gradient compensation, and propose a new distributed optimization method named CP-SGD, which can hide the overhead of gradient compression, overlap part of the communication and obtain high convergence accuracy. The local update operation in CP-SGD allows the next iteration to be launched quickly without waiting for the completion of gradient compression and the current communication process. Besides, the accuracy loss caused by gradient compression is solved by k-step correction method introduced in CP-SGD, which provides a gradient correction every k iterations. We prove that CP-SGD has a convergence guarantee and it achieves at least O(1K+1K) convergence rate, where K is the number of iterations. We conduct extensive experiments on MXNet to verify the convergence properties and scaling performance of CP-SGD. Experimental results on a 32-GPU cluster show that convergence accuracy of CP-SGD is close to or even slightly better than that of S-SGD, and its end-to-end time is 30% less than 2-bit gradient compression under a 56Gbps bandwidth environment. In addition, we analyze the performance of CP-SGD when training on 8, 16 and 32 GPUs. It is found that CP-SGD is suitable for most compression-supported update algorithms, and its scalability is approximately linear.

Gradient Compression Supercharged High-Performance Data Parallel DNN Training.

A Generic, High-Performance, Compression-Aware Framework for Data Parallel DNN Training

Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training

Compressed Communication for Distributed Training: Adaptive Methods and System

Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.

An Efficient Bandwidth-Adaptive Gradient Compression Algorithm for Distributed Training of Deep Neural Networks

CP-SGD: Distributed Stochastic Gradient Descent with Compression and Periodic Compensation

DAGC: Data-Aware Adaptive Gradient Compression.

An efficient statistical-based gradient compression technique for distributed training systems

Near-Linear Scaling Data Parallel Training with Overlapping-Aware Gradient Compression

Sparse Gradient Compression For Distributed Sgd

GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training

Communication Efficient SGD via Gradient Sampling with Bayes Prior

SK-Gradient: Efficient Communication for Distributed Machine Learning with Data Sketch.

RedSync: Reducing Synchronization Bandwidth for Distributed Deep Learning Training System

RedSync : Reducing Synchronization Traffic for Distributed Deep Learning

PipeCompress: Accelerating Pipelined Communication for Distributed Deep Learning