Coded Parallelism for Distributed Deep Learning.

Songting Ji,Zhaoyang Zhang,Zhaohui Yang,Richeng Jin,Qianqian Yang
DOI: https://doi.org/10.1109/isit54713.2023.10206933
2023-01-01
Abstract:With the rapid development of deep learning, the parameters of modern neural network models, especially in the field of Natural Language Processing (NLP) are extremely huge. When the parameters of the model are larger even than the storage memory of a single device, it is necessary to split the original big learning model into different parts with each part assigned to one device, thus realizing joint model training over different devices (i.e., distributed training). In this paper, we aim to introduce the advanced coding scheme into the distributed parallel framework, which leads to the perfect combination of coding and the underlying calculation of neural networks. The proposed scheme is not only able to avoid the impact of poor computing power or low bandwidth and even dropped devices (stragglers) on system performance but also reduce the communication load between different devices, thereby greatly improving the performance of distributed parallel systems.
What problem does this paper attempt to address?