A Topology-Aware Performance Prediction Model for Distributed Deep Learning on GPU Clusters

Zheyu Lin,Xukun Chen,Hanyu Zhao,Yunteng Luan,Zhi Yang,Yafei Dai
DOI: https://doi.org/10.1109/bigdata50022.2020.9378252
2020-01-01
Abstract:Today, multi-GPU training has become a common practice for deep learning workloads. The performance of a training job could be affected significantly by both the GPU connectivity in the system topology and the computation-communication pattern of the job. This highlights the necessity of the awareness of jobs’ performance characteristics for cluster schedulers to improve both job and cluster efficiency.In this paper, we propose an online resource-performance model for deep learning training jobs on GPU clusters. This model can estimate the training speed as a function of any given resource setting (i.e., the number and locality of GPUs) for a specific job. The model is based on systematic modeling of the system topology and the communication patterns of individual jobs with online fitting on a sample set of profiled performance data. Experiments show that our performance model achieves 94% prediction accuracy on average (up to 99.9%). Additionally, a large-scale simulation on a real production trace demonstrates that our model helps a typical scheduling algorithm decrease average job completion time by 3.4x and makespan by 1.7x.
What problem does this paper attempt to address?