Bbtopk: Bandwidth-Aware Sparse Allreduce with Blocked Sparsification for Efficient Distributed Training

Chang Chen,Min Li,Chao Yang
DOI: https://doi.org/10.1109/icdcs57875.2023.00015
2023-01-01
Abstract:Communication overhead is one of the major bottlenecks for large-scale distributed model training. Sparse gradient has been proposed to reduce the communication volume dramatically without affecting the model accuracy. However, high performance implementation of sparse gradient is still hindered by the overheads of gradient sparsification and inefficient implementation of sparse allreduce. For a sparse allreduce operation, the density of intermediate results may dynamically increase as summations progress. Meanwhile, high-performance sparse allreduce is further complicated with heterogeneous bandwidths. To tackle these challenges, we propose bbTopk for efficient sparse gradient training, which includes a new blocked top $k$ sparsification technique and a novel bandwidth-aware sparse allreduce algorithm. In particular, to alleviate the sparsification overhead, we design a blocked top- $k$ method, which can reduce the overhead without sacrificing model accuracy. We then build a heterogeneity-aware communication model combined with the dynamic workload feature of the sparse allreduce. Guided by that, a new sparse allreduce algorithm is proposed that can take advantage of the network resources by improving bandwidth utilization, reducing cross-node hops, and adjusting the round order. Experiments are conducted on a variety of typical neural networks and distributed environments. Results show that bbTopk can substantially outperform the previous state-of-the-art work in most test cases with up to 2.57x speedup while achieving similar accuracy with the dense model empirically.
What problem does this paper attempt to address?