OASR-WFBP: An overlapping aware start-up sharing gradient merging strategy for efficient communication in distributed deep learning

Yingjie Song,Zhuo Tang,Yaohua Wang,Xiong Xiao,Zhizhong Liu,Jing Xia,Kenli Li
DOI: https://doi.org/10.1016/j.jpdc.2024.104997
IF: 4.542
2024-10-19
Journal of Parallel and Distributed Computing
Abstract:Wait-Free-Back-Propagation (WFBP) is a practical method for distributed deep-learning, but it suffers from a high communication overhead. To address this issue, the communication overhead can be reduced by overlapping gradient communication and computation, and sharing the startup time among multiple gradient communication phases. However, existing optimizations choose to share the startup time greedily and fail to coordinately exploit the overlapping opportunity between computation and communication. We propose an overlapping aware startup sharing Wait-Free-Back-Propagation (OASR-WFBP). An analytic model is designed to guide the sharing procedure. Evaluations show that OSAR-WFBP achieves a 5%-16% optimization in iteration time over the state-of-the-art WFBP algorithm.
computer science, theory & methods
What problem does this paper attempt to address?