A Survey on Performance Modeling and Prediction for Distributed DNN Training

Zhenhua Guo,Yinan Tang,Jidong Zhai,Tongtong Yuan,Jian Jin,Li Wang,Yaqian Zhao,Rengang Li
DOI: https://doi.org/10.1109/tpds.2024.3476390
IF: 5.3
2024-10-30
IEEE Transactions on Parallel and Distributed Systems
Abstract:The recent breakthroughs in large-scale DNN attract significant attention from both academia and industry toward distributed DNN training techniques. Due to the time-consuming and expensive execution process of large-scale distributed DNN training, it is crucial to model and predict the performance of distributed DNN training before its actual deployment, in order to optimize the design of distributed DNN training at low cost. This paper analyzes and emphasizes the importance of modeling and predicting the performance of distributed DNN training, categorizes and analyses the related state-of-the-art works, and discusses future challenges and opportunities for this research field. The objectives of this paper are twofold: first, to assist researchers in understanding and choosing suitable modeling and prediction tools for large-scale distributed DNN training, and second, to encourage researchers to propose more valuable research about performance modeling and prediction for distributed DNN training in the future.
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?