Abstract:Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combined with a parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exist two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to a decrease in convergence accuracy. In this paper, we combine parallel mechanism with gradient quantization and periodic full-gradient compensation, and propose a new distributed optimization method named CP-SGD, which can hide the overhead of gradient compression, overlap part of the communication and obtain high convergence accuracy. The local update operation in CP-SGD allows the next iteration to be launched quickly without waiting for the completion of gradient compression and the current communication process. Besides, the accuracy loss caused by gradient compression is solved by k-step correction method introduced in CP-SGD, which provides a gradient correction every k iterations. We prove that CP-SGD has a convergence guarantee and it achieves at least O(1K+1K) convergence rate, where K is the number of iterations. We conduct extensive experiments on MXNet to verify the convergence properties and scaling performance of CP-SGD. Experimental results on a 32-GPU cluster show that convergence accuracy of CP-SGD is close to or even slightly better than that of S-SGD, and its end-to-end time is 30% less than 2-bit gradient compression under a 56Gbps bandwidth environment. In addition, we analyze the performance of CP-SGD when training on 8, 16 and 32 GPUs. It is found that CP-SGD is suitable for most compression-supported update algorithms, and its scalability is approximately linear.

A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication.

Efficient Distributed Stochastic Gradient Descent Through Gaussian Averaging.

A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification.

Sparse Communication for Training Deep Networks

On the Convergence of Quantized Parallel Restarted SGD for Central Server Free Distributed Training

Quantized Epoch-SGD for Communication-Efficient Distributed Learning.

DQSGD: DYNAMIC QUANTIZED STOCHASTIC GRADIENT DESCENT FOR COMMUNICATION-EFFICIENT DISTRIBUTED LEARNING

Decentralized SGD with Asynchronous, Local and Quantized Updates

Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient

Error compensated quantized SGD and its applications to large-scale distributed optimization

Sparse Gradient Compression For Distributed Sgd

Lazily Aggregated Quantized Gradient Innovation for Communication-Efficient Federated Learning.

Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs

STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

Global-QSGD: Practical Floatless Quantization for Distributed Learning with Theoretical Guarantees

CP-SGD: Distributed Stochastic Gradient Descent with Compression and Periodic Compensation

DQ-SGD: Dynamic Quantization in SGD for Communication-Efficient Distributed Learning

A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

Truncated Non-Uniform Quantization for Distributed SGD

Convergence Theory of Generalized Distributed Subgradient Method with Random Quantization