A Load-Balancing Strategy Based on Multi-Task Learning in a Distributed Training Environment

Yinqi Huang,Wenjiang Zhou,Liang Li,Ting Lyu,Haitao Xu
DOI: https://doi.org/10.1109/aeeca59734.2023.00158
2023-01-01
Abstract:With the development of machine learning and big data technologies, distributed training has become an important way to improve computational efficiency. However, in the distributed training environment, the performance difference between workers and the interference of unrelated tasks may lead to the uneven load of the system and cause the "straggler phenomenon". Therefore, how to solve the straggler phenomenon and achieve load balancing of the system is a pressing problem of distributed training. To address the shortcomings of the existing parallel computing model DSSP, this paper proposes a load-balancing strategy for DDP, which effectively reduces the load difference by adjusting the batch size and data size of workers in the training process. Then, to explore the correlation between data size in DDP and synchronization threshold in DSSP in dynamic adjustment, we perform multi-task learning for the two dynamic adjustment strategies and integrate the proposed Joint Multi-Task Prediction scheme on DSSP to implement a new parallel computing model ESP. extensive experiments The results show that ESP can not only guarantee the model accuracy but also effectively improve the training speed in distributed training.
What problem does this paper attempt to address?