BIRD: A Lightweight and Adaptive Compressor for Communication-Efficient Distributed Learning Using Tensor-wise Bi-Random Sampling

Donglei Wu,Weihao Yang,Cai Deng,Xiangyu Zou,Shiyi Li,Wen Xia
DOI: https://doi.org/10.1109/iccd58817.2023.00096
2023-01-01
Abstract:Top-K sparsification-based compression framework is widely employed to reduce communication costs in distributed learning. However, we have identified several issues with existing Top-K sparsification-based compression methods that severely impede their deployment in resource-constrained devices: (i) the limited compressibility of the Top-K parameter’s indexes, which critically restricts the overall communication compression ratio; (ii) several time-consuming compression operations significantly negate the benefits of communication compression; (iii) the high memory footprint consumption associated with error feedback techniques used to maintain model quality.To address these issues, we propose a lightweight tensor-wise Bi-Random sampling strategy with expectation invariance property called BIRD, which achieves higher compression ratios at lower computational overheads while maintaining a comparable model quality without additional memory costs. Specifically, BIRD applies a tensor-wise index sharing mechanism that substantially reduces the proportion of the index by allowing multiple tensor elements to share a single index, thus improving the overall compression ratio. Additionally, BIRD replaces the time-consuming Top-K sorting with a faster Bi-Random sampling strategy based on the aforementioned index sharing mechanism, thereby reducing the computational costs of compression; Moreover, BIRD establishes an expectation invariance property into the above Bi-Random sampling to ensure an unbiased representation for the L 1 -norm of the sampled tensors, effectively maintaining the model quality without incurring extra memory costs.Experiments on multiple mainstream machine learning (ML) tasks demonstrate that compared to state-of-the-art methods, our proposed BIRD achieves 1.3×-31.1× higher compression ratio at lower time overheads with O(N) complexity while maintaining the model quality without incurring extra memory costs.
What problem does this paper attempt to address?