BinSGDM: Extreme One-Bit Quantization for Communication Efficient Large-Scale Distributed Training

Hanyang Peng,Shuang Qin,Yue Yu,Jin Wang,Hui Wang,Ge Li
2023-01-01
Abstract:To alleviate the communication bottleneck of large-scale distributed training, a rich body of prior communication-compression optimizers have been proposed. These methods focus mainly on high compression ratio to expect acceleration. However, some recent works pointed out, when running with distributed training frameworks ( \emph{e.g.}, \emph{DistributedDataParallel} in pytorch), these methods may provide no acceleration over the off-the-shelve uncompressed SGD/Adam in the typical settings, due to heavy compression/decompression computation or incompatibility with efficient communication primitives or the requirement of uncompressed warmup at the early stage. For these reasons, we propose a novel extreme one-bit quantization optimizer, dubbed \emph{BinSGDM}. The quantization of \emph{BinSGDM} is computed easily and lightly, and it does not need to resort to uncompressed optimizers for warmup. We also theoretically prove that it can promise the same convergence speed as the original Adam. Moreover, we specially present a hierarchical communication scheme to further lower the communication volume. Extensive experiments are conducted on 8 to 64 GPUs (1 to 8 nodes) for distributed training with \emph{DistributedDataParallel}, and the experimental results demonstrates that \emph{BinSGDM} with the communication scheme can achieve up to {$\bm{2.47 \times}$} speedup for training ResNet-50 and $\bm{6.26\times}$ speedup for training BERT-Base, compared to the full-precision optimizers.
What problem does this paper attempt to address?