Exploring Job Running Path to Predict Runtime on Multiple Production Supercomputers

Wenxiang Yang,Xiangke Liao,Dezun Dong,Jie Yu
DOI: https://doi.org/10.1016/j.jpdc.2023.01.001
IF: 4.542
2023-01-01
Journal of Parallel and Distributed Computing
Abstract:There are massive jobs submitted in the supercomputer, and the job management system is typically deployed to schedule these jobs and allocate compute resources. FCFS (First Come First Serve) is a popular scheduling policy and the job's priority is determined based on the arrival time. However, under the FCFS strategy, if existing idle resources cannot meet the requirement of the head job in the waiting queue, they cannot be allocated to other jobs, which suffers from resource waste. To optimize the resource utilization, the backfilling method is proposed, which allocates the reserved idle compute nodes to a small-size and short-running non-head job, on the premise of not delaying the original head job. Obtaining the job's runtime in advance is necessary for backfilling and the traditional method relies on the user's estimation. Unfortunately, the estimated runtime provided by users is generally overestimated. Many studies extract features from historical job logs and adopt machine learning to predict the runtime. However, traditional features are insufficient to characterize the job. In this paper, we collect job logs from two supercomputers and present a novel runtime prediction framework called PREP. PREP explores the job's running path as a new feature, which implies plentiful information about the job's properties, such as the user, the project, the scale of data sets, and the parameters used. As there is a strong correlation between the job's runtime and its running path, we group jobs with similar paths into a cluster and train a runtime prediction model for each cluster respectively. Extensive evaluations demonstrate that introducing the new feature can achieve higher prediction accuracy (88.5% and 82.3% in two production supercomputers respectively), and our framework has a more desirable prediction performance than other popular strategies like Last-2 and IRPA. In addition, the predicted runtime is inserted into the real job trace of a slurm simulator to verify the advantages of PREP.
What problem does this paper attempt to address?