Deep Reinforcement Agent for Failure-aware Job Scheduling in High-Performance Computing.

Kang Yang,Rongyu Cao,Yueyuan Zhou,Jiawei Zhang,En Shao,Guangming Tan
DOI: https://doi.org/10.1109/icpads53394.2021.00061
2021-01-01
Abstract:Job scheduling is crucial in high-performance computing (HPC), which is dedicated to deciding when and which jobs are allocated to the system and placing the jobs on which resources, by considering multiple scheduling goals. Along with the incremental of various resources and dazzling deep learning training (DLT) workloads, job failure becomes a quite common issue in HPC, which will affect user sa...
What problem does this paper attempt to address?