Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Zhenheng Tang,Shaohuai Shi,Wei Wang,Bo Li,Xiaowen Chu
2023-09-01
Abstract:Distributed deep learning (DL) has become prevalent in recent years to reduce training time by leveraging multiple computing devices (e.g., GPUs/TPUs) due to larger models and datasets. However, system scalability is limited by communication becoming the performance bottleneck. Addressing this communication issue has become a prominent research topic. In this paper, we provide a comprehensive survey of the communication-efficient distributed training algorithms, focusing on both system-level and algorithmic-level optimizations. We first propose a taxonomy of data-parallel distributed training algorithms that incorporates four primary dimensions: communication synchronization, system architectures, compression techniques, and parallelism of communication and computing tasks. We then investigate state-of-the-art studies that address problems in these four dimensions. We also compare the convergence rates of different algorithms to understand their convergence speed. Additionally, we conduct extensive experiments to empirically compare the convergence performance of various mainstream distributed training algorithms. Based on our system-level communication cost analysis, theoretical and experimental convergence speed comparison, we provide readers with an understanding of which algorithms are more efficient under specific distributed environments. Our research also extrapolates potential directions for further optimizations.
Distributed, Parallel, and Cluster Computing,Machine Learning,Signal Processing
What problem does this paper attempt to address?
The paper aims to address the issue of communication efficiency in Distributed Deep Learning (DL). Specifically, as the scale of models and datasets continues to grow, the training process becomes extremely time-consuming and computationally intensive. To accelerate this process, distributed training has become an effective method, but the accompanying communication costs have become a bottleneck for system scalability. The main objectives of the paper include: 1. **Comprehensive Review**: Provide a comprehensive review of communication-efficient data-parallel distributed deep learning algorithms, covering optimization methods from the system level to the algorithm level. 2. **Classification Framework**: Propose a classification framework that divides data-parallel distributed training algorithms into four main dimensions: communication synchronization, system architecture, compression techniques, and the parallelism of communication and computation tasks. 3. **Algorithm Comparison**: Conduct a theoretical analysis of the convergence speed of different algorithms and compare the convergence performance of various mainstream distributed training algorithms through extensive experiments. 4. **Experimental Validation**: Based on system-level communication cost analysis and theoretical and experimental convergence speed comparisons, help readers understand which algorithms are more efficient in specific distributed environments. 5. **Future Directions**: Explore potential further optimization directions. Through this work, the paper hopes to provide researchers and engineers with a comprehensive understanding to inspire them to develop new efficient distributed training algorithms and frameworks.