Dynamic Distribution Strategy of Distributed Tasks Based on Limited Synchronous Parallel

Weibo Zhu,Meiting Xue,Peiran Xu,Jilin Zhang,Chao Sun
DOI: https://doi.org/10.21203/rs.3.rs-1768291/v1
2022-01-01
Abstract:Abstract In a distributed deep learning system based on mobile service computing, there may be differences in performance between computing nodes, which may be affected by the external environment, resulting in training interruption or low convergence speed. Therefore, in this study, we propose a strategy to dynamically adjust the number of tasks between computing nodes for a scenario in which node performance changes lead to the slow training of distributed deep learning systems. Based on this, we designed a distributed parallel computing strategy called weight-based load balancing (WLBS). The WLBS divides and adjusts the task volume of computing nodes through initial division and partial adjustment; thus, computing nodes with lagging performance can exert their maximum computing power under the premise of reducing the lag impact on the entire system training. Experiments show that in distributed deep learning model training, the WLBS can effectively reduce the model training time while ensuring the accuracy of the model.
What problem does this paper attempt to address?