Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters

Zhaoyun Chen,Lei Luo,Wei Quan,Mei Wen,Chunyuan Zhang
DOI: https://doi.org/10.1109/infcomw.2019.8845276
2019-01-01
Abstract:With the recent widespread adoption of deep learning (DL) in academia and industry, more attention are attracted by DL platform, which can support research and development (R&D) of AI firms, institutes and universities. Towards an off-the-shelf distributed GPU cluster, prior work propose prediction-based schedulers to allocate resources for diverse DL workloads. However, the prediction-based schedulers have disadvantages on prediction accuracy and offline-profiling costs. In this paper, we propose a learning-based scheduler, which models the scheduling problem as a reinforcement learning problem, achieving minimum average job completion time and maximum system utilization. The scheduler contains the designs of state space, action space, reward function and update scheme. Furthermore, we will evaluate our proposed scheduler implemented as a plugin of Tensorflow on real cluster and large-scale simulation.
What problem does this paper attempt to address?