MP-DPS: Adaptive Distributed Training for Deep Learning Based on Node Merging and Path Prediction

Yan Zeng,Yong Ding,Dongyang Ou,Jilin Zhang,Yongjian Ren,Yunquan Zhang
DOI: https://doi.org/10.1007/s42514-022-00098-9
2022-01-01
CCF Transactions on High Performance Computing
Abstract:With the increasing scale of data sets and neural network models, distributed training of deep neural networks has become a trend. The main distributed parallel technology is based on expert experience, it is low efficient and hard to optimize as it needs lots of domain knowledge. There are some researchers have proposed auto-parallel technology to implement model distributed training which focused on specific models and parallel optimization factors. These methods have the problems of single factor of performance optimization, complex and low efficiency, etc. In this paper, we propose an adaptive distributed parallel training method (MP-DPS), based on the node merging of heterogeneous computing power-aware and path prediction, to search optimal parallel strategy automatically in large-scale networks. Firstly, we construct a multidimensional performance cost model to guide the design and implementation of the distributed parallel strategy. Secondly, we propose a node merging method with heterogeneous computing power awareness to reduce the search space and improve search efficiency. Finally, a graph search algorithm based on path prediction is proposed, it finds the optimal distributed parallel strategy by optimizing critical path execution time, which is based on predicting the optimal placement of critical operator node on the path. The experiments show that the deep learning model (such as ResNet, NasNet, etc.) can effectively be trained on 4 GPU and 8 GPU (P100) with the distributed parallel strategy searched by MP-DPS method, and the search time of optimal distributed parallel strategy can be reduced efficiently, compared with the FastT method.
What problem does this paper attempt to address?