Abstract:Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combined with a parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exist two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to a decrease in convergence accuracy. In this paper, we combine parallel mechanism with gradient quantization and periodic full-gradient compensation, and propose a new distributed optimization method named CP-SGD, which can hide the overhead of gradient compression, overlap part of the communication and obtain high convergence accuracy. The local update operation in CP-SGD allows the next iteration to be launched quickly without waiting for the completion of gradient compression and the current communication process. Besides, the accuracy loss caused by gradient compression is solved by k-step correction method introduced in CP-SGD, which provides a gradient correction every k iterations. We prove that CP-SGD has a convergence guarantee and it achieves at least O(1K+1K) convergence rate, where K is the number of iterations. We conduct extensive experiments on MXNet to verify the convergence properties and scaling performance of CP-SGD. Experimental results on a 32-GPU cluster show that convergence accuracy of CP-SGD is close to or even slightly better than that of S-SGD, and its end-to-end time is 30% less than 2-bit gradient compression under a 56Gbps bandwidth environment. In addition, we analyze the performance of CP-SGD when training on 8, 16 and 32 GPUs. It is found that CP-SGD is suitable for most compression-supported update algorithms, and its scalability is approximately linear.

CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation

CP-SGD: Distributed Stochastic Gradient Descent with Compression and Periodic Compensation

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

Sparse Gradient Compression For Distributed Sgd

Compressed Communication for Distributed Training: Adaptive Methods and System

AC-SGD: Adaptively Compressed SGD for Communication-Efficient Distributed Learning

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.

Step-Ahead Error Feedback for Distributed Training with Compressed Gradient

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

Communication Efficient SGD via Gradient Sampling with Bayes Prior

Sparse Communication for Training Deep Networks

Efficient Distributed Stochastic Gradient Descent Through Gaussian Averaging.

SGC: Similarity-Guided Gradient Compression for Distributed Deep Learning

Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training

SK-Gradient: Efficient Communication for Distributed Machine Learning with Data Sketch.

DAGC: Data-Aware Adaptive Gradient Compression.

Error Compensated Distributed SGD Can Be Accelerated

An efficient statistical-based gradient compression technique for distributed training systems