Predicting accurate batch queue wait times on production supercomputers by combining machine learning techniques
Nick Brown,Gordon Gibb,Evgenij Belikov,Rupert Nash
DOI: https://doi.org/10.1002/cpe.8112
2024-04-12
Concurrency and Computation Practice and Experience
Abstract:The ability to accurately predict when a job on a supercomputer will leave the queue and start to run is not only beneficial for providing insights to users, but can also help enable non‐traditional HPC workloads that are not necessarily suited to the batch queue style‐approach that is ubiquitous on production HPC machines. However there are numerous challenges in achieving such a prediction with high accuracy, not least because the queue's state can change rapidly and depend upon many factors. In this work, we explore a novel machine learning approach for predicting queue wait times, hypothesising that such a model can capture the complex behavior resulting from the queue policy and other interactions to generate accurate job start times. For ARCHER2 (HPE Cray EX), Cirrus (HPE 8600), and 4‐cabinet (HPE Cray EX) we explore how different machine learning approaches and techniques improve the accuracy of our predictions, comparing against the estimation generated by Slurm. By combining categorization and regression models, we demonstrate that our approach delivers the most accurate predictions across our machines of interest, with the result of this work being the ability to predict job start times within 1 min of the actual start time for around 65% of jobs on ARCHER2 and 4‐cabinet, and 76% of jobs on Cirrus. When compared against what Slurm can deliver, via the backfill plugin, this represents around 3.8 times better accuracy on ARCHER2 and 18 times better for Cirrus. Furthermore our approach can accurately predicting the start time for three quarters of all job within 10 min of the actual start time on ARCHER2 and 4‐cabinet, and for 90% of jobs on Cirrus. Whilst the initial driver of this work was to better facilitate non‐traditional, interactive and urgent, workloads on HPC machines, the insights gained can also be used to provide wider benefits to users, enrich existing batch queue systems, and inform supercomputing center policy also.
computer science, theory & methods, software engineering