TIAS: Two-level Information-Agnostic Job Scheduling in GPU Clusters

Kun Yang,Jieyu Lin,Wei Ni,Liang Song
DOI: https://doi.org/10.1109/insai54028.2021.00041
2021-01-01
Abstract:In recent years, deep learning algorithms have shown a trend towards larger models and larger datasets. Centralized training is unable keep up with the training requirements due to limited storage and computing resources, thus distributed learning is becoming an important area of research for improving learning efficiency. There are many studies on using the features of deep learning workload to design a central scheduler for production clusters.While existing work has been focusing on overall completion time and resource efficiency, little attention has been paid to the execution deadlines. To achieve a balance between the goals of deadline and non-deadline jobs, we design a Two-level Information-Agnostic Scheduling strategy(TIAS), which can schedule the two kinds of jobs together without knowing jobs’ training duration. In the first level, we use different priority calculation methods for the two kinds of jobs; in the second level, we design a new indicator "queue urgency" based on three observations to sort deadline jobs within the same queue. Experiments on a trace-driven simulator prove that TIAS can achieve the best trade-off between deadline miss rate and non-deadline jobs’ average job completion time(JCT) compared to existing solutions.
What problem does this paper attempt to address?