PerfTop: Towards Performance Prediction of Distributed Learning over General Topology

Changzhi Yan,Zehan Zhu,Youcheng Niu,Cong Wang,Cheng Zhuo,Jinming Xu
DOI: https://doi.org/10.1016/j.jpdc.2024.104922
IF: 4.542
2024-05-26
Journal of Parallel and Distributed Computing
Abstract:Distributed learning with multiple GPUs has been widely adopted to accelerate the training process of large-scale deep neural networks. However, misconfiguration of the GPU clusters with various communication primitives and topologies could potentially diminish the gains in parallel computation and lead to significant degradation in training efficiency. Predicting the performance of distributed learning enables service providers to identify potential bottlenecks beforehand. In this work, we propose a Perf ormance prediction framework over General Top ologies, called PerfTop, for accurate estimation of per-iteration execution time. The main strategy is to integrate computation time prediction with an analytical model to map the nonlinearity in communication and fine-grained computation-communication patterns. This enables accurate prediction of a variety of neural network models over general topologies, such as tree, hierarchical, and exponential. Our extensive experiments show that PerfTop outperforms existing methods in estimating both computation and communication time, particularly for communication, surpassing the existing methods by over 45%. Meanwhile, it achieves an accuracy of above 85% in predicting the execution time over general topologies compared to simple topologies such as star and ring from the previous works.
computer science, theory & methods
What problem does this paper attempt to address?