A QoS-oriented Scheduling and Autoscaling Framework for Deep Learning

Sikai Xing,Shiyou Qian,Bin Cheng,Jian Cao,Guangtao Xue,Jiadi Yu,Yanmin Zhu,Minglu Li
DOI: https://doi.org/10.1109/ijcnn.2019.8852319
2019-01-01
Abstract:Deep learning is popular in many areas, but users must manually specify the resource configuration when submitting deep learning training jobs, usually over-provisioning resources. This kind of unreasonable resource configuration method results in slow training and low resource utilization. Therefore, it would be more convenient and efficient if users only need to specify the quality of service (QoS) for their jobs, and then the resources will be autoconfigured to meet the QoS. To satisfy this demand, we present a QoS-oriented scheduling and autoscaling framework that schedules and autoscales deep learning training jobs in the Kubernetes cluster. This paper focuses on the most important QoS requirement for deep learning training jobs: deadline.The goal of the framework is to guarantee that as many jobs as possible can be accomplished before their specified deadlines. To reach this goal, the framework schedules deep learning jobs by implementing a heuristic scheduling policy based on resource status and job deadline, and autoscales resource configuration by exploiting a characteristic of deep learning jobs: the predictability of training time. This predictability is used to predict whether a job can be accomplished before its deadline and estimate appropriate resource configuration if necessary.We implemented the framework by modifying the default scheduler of Kubernetes and conducted experiments to evaluate its performance. The experiment results show that our scheduling policy can improve the completion rate by 26% when the cluster resources are insufficient, and our autoscaling policy can improve the completion rate to 100% when the cluster resources are sufficient. We also show that the framework improves the utilization of allocated CPUs to 100%. Our proposed framework points to a new way of submitting and managing deep learning training jobs in the cluster.
What problem does this paper attempt to address?