Machine Learning Based Performance Analysis and Prediction of Jobs on a HPC Cluster

Zhengxiong Hou,Shuxin Zhao,Chao Yin,Yunlan Wang,Jianhua Gu,Xingshe Zhou
DOI: https://doi.org/10.1109/pdcat46702.2019.00053
2019-01-01
Abstract:There are a lot of middle-class or small-class high-performance computing clusters at universities and research institutes, etc. Large volumes of job logs have been accumulated after many years of operation. In this paper, on the basis of accumulated job logs on a high-performance computing cluster, we examine and analyze the job logs. Then, we study machine learning based performance analysis and prediction methods for parallel jobs. Various machine learning methods such as multivariate linear fitting, artificial neural network are used to build performance prediction models. We compare the errors of each model, and select the optimal prediction model for different users. The experimental results show that we can obtain reasonable prediction accuracy using the selected machine learning algorithms.
What problem does this paper attempt to address?