Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems.

Yifu Zeng,Bowei Chen,Pulin Pan,Kenli Li,Guo Chen
DOI: https://doi.org/10.1155/2023/2663115
IF: 8.993
2023-01-01
International Journal of Intelligent Systems
Abstract:Distributed deep learning systems effectively respond to the increasing demand for large-scale data processing in recent years. However, the significant investment in building distributed learning systems with powerful computing nodes places a huge financial burden on developers and researchers. It will be good to predict the precise benefit, i.e., how many times of speedup it can get compared with training on single machine (or a few), before actually building such big learning systems. To address this problem, this paper presents a novel performance model on training iteration time for heterogeneous distributed deep learning systems based on the characteristics of the parameter server (PS) system with bulk synchronous parallel (BSP) synchronization style. The accuracy of our performance model is demonstrated by comparing real measurement results on TensorFlow when training different neural networks with various kinds of hardware testbeds: the prediction accuracy is higher than 90% in most cases.
What problem does this paper attempt to address?