Sparse Gradient Compression For Distributed Sgd

Haobo Sun,Yingxia Shao,Jiawei Jiang,Bin Cui,Kai Lei,Yu Xu,Jiang Wang
DOI: https://doi.org/10.1007/978-3-030-18579-4_9
2019-01-01
Abstract:Communication bandwidth is a bottleneck in distributed machine learning, and limits the system scalability. The transmission of gradients often dominates the communication in distributed SGD. One promising technique is using the gradient compression to reduce the communication cost. Recently, many approaches have been developed for the deep neural networks. However, they still suffer from the high memory cost, slow convergence and serious staleness problems over sparse high-dimensional models. In this work, we propose Sparse Gradient Compression (SGC) to efficiently train both the sparse models and the deep neural networks. SGC uses momentum approximation to reduce the memory cost with negligible accuracy degradation. Then it improves the accuracy with long-term gradient compensation, which maintains global momentum to make up for the information loss caused by the approximation. Finally, to alleviate the staleness problem, SGC updates model weight with the accumulation of delayed gradients at local, called local update technique. The experiments over the sparse high-dimensional models and deep neural networks indicate that SGC can compress 99.99% gradients for every iteration without performance degradation, and saves the communication cost up to 48x.
What problem does this paper attempt to address?