Two-Stage Coded Distributed Learning: A Dynamic Partial Gradient Coding Perspective

Xinghan Wang,Xiaoxiong Zhong,Jiahong Ning,Tingting Yang,Yuanyuan Yang,Guoming Tang,Fangming Liu
DOI: https://doi.org/10.1109/icdcs57875.2023.00020
2022-01-01
Abstract:Distributed learning has been widely adopted to train a global model from local data. However, its performance can be severely affected by stragglers. Recently, some research has been dedicated to resolving the straggler problem by adopting gradient coding, the essence of gradient coding is to solve the straggler problem by adding data redundancy. However, the large amount of data redundancy as well as computation and communication overhead that it brings is still hard to be resolved. Besides, the complexity of the encoding and decoding will increase linearly with the number of the local workers. To this end, in this paper, we design a lightweight coding method in the computing phase and seek to ensure fair transmission in the communication phase. Specifically, to tolerate stragglers in computing phase, we propose a two-stage dynamic coding scheme, part of the workers start computing the partial gradients from the data partitions assigned in the first stage, and the remaining workers for computation in the second stage is decided based on which workers have finished in the first stage. To further tolerate stragglers in the communication phase, a perturbed Lyapunov function is designed to maximize admission data balancing fairness as well as the throughput. The experimental result verifies the derived properties and demonstrates that our proposed solution can achieve a better performance for practical network parameters and benchmark data in terms of accuracy and resource utilization in the distributed learning system.
What problem does this paper attempt to address?