Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training

Xue Li,Cheng Guo,Kun Qian,Menghao Zhang,Mengyu Yang,Mingwei Xu
DOI: https://doi.org/10.1145/3698038.3698541
2024-01-01
Abstract:Data parallelism has become a cornerstone in scaling up the training of deep neural networks (DNNs). However, the communication overhead associated with synchronizing gradients across multiple nodes has emerged as a significant bottleneck, adversely affecting training efficiency and leading to a surge in large-scale distributed model training costs. By leveraging insights into the statistical characteristics of gradients, we present GComp, a near-lossless gradient compression scheme designed to reduce the communication burden during data-parallel training significantly. GComp develops an optimized Huffman encoding/decoding strategy to compress gradient exponents effectively. Additionally, it introduces an innovative multi-level quantization method for mantissa, complemented by a pruning strategy that eliminates zero-valued gradients. These integrated approaches significantly reduce the volume of data for synchronization, while virtually not affecting the DNN model's training accuracy. We conduct comprehensive evaluations of GComp, demonstrating that our method can decrease the communication volume by as much as 67.1%, and enhance training speed by up to 1.9×.
What problem does this paper attempt to address?