Abstract:Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combined with a parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exist two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to a decrease in convergence accuracy. In this paper, we combine parallel mechanism with gradient quantization and periodic full-gradient compensation, and propose a new distributed optimization method named CP-SGD, which can hide the overhead of gradient compression, overlap part of the communication and obtain high convergence accuracy. The local update operation in CP-SGD allows the next iteration to be launched quickly without waiting for the completion of gradient compression and the current communication process. Besides, the accuracy loss caused by gradient compression is solved by k-step correction method introduced in CP-SGD, which provides a gradient correction every k iterations. We prove that CP-SGD has a convergence guarantee and it achieves at least O(1K+1K) convergence rate, where K is the number of iterations. We conduct extensive experiments on MXNet to verify the convergence properties and scaling performance of CP-SGD. Experimental results on a 32-GPU cluster show that convergence accuracy of CP-SGD is close to or even slightly better than that of S-SGD, and its end-to-end time is 30% less than 2-bit gradient compression under a 56Gbps bandwidth environment. In addition, we analyze the performance of CP-SGD when training on 8, 16 and 32 GPUs. It is found that CP-SGD is suitable for most compression-supported update algorithms, and its scalability is approximately linear.

Error Compensated Loopless SVRG, Quartz, and SDCA for Distributed Optimization

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

ErrorCompensatedX: error compensation for variance reduced algorithms

Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

Error Compensated Distributed SGD Can Be Accelerated

Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

CP-SGD: Distributed Stochastic Gradient Descent with Compression and Periodic Compensation

Optimal Accelerated Variance Reduced EXTRA and DIGing for Strongly Convex and Smooth Decentralized Optimization.

Distributed Learning with Convex SUM-of -Non-convex Objective

Sparse Gradient Compression For Distributed Sgd

AC-SGD: Adaptively Compressed SGD for Communication-Efficient Distributed Learning

Communication-Efficient Distributed Learning with Local Immediate Error Compensation

EControl: Fast Distributed Optimization with Compression and Error Control

Variance-reduced Reshuffling Gradient Descent for Nonconvex Optimization: Centralized and Distributed Algorithms

Convergence Bounds for Compressed Gradient Methods with Memory Based Error Compensation

Distributed learning with compressed gradient differences*

Compressed Gradient Tracking Algorithms for Distributed Nonconvex Optimization

Compressed Gradient Methods With Hessian-Aided Error Compensation

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization