Design and Performance Modeling of A YARN-based GPU Resource Scheduling System

Jianbo Huang,Wenli Zhou,Ruoning Song,Fang Liu,Siye Wang,Jun Liu
DOI: https://doi.org/10.1109/ICCC47050.2019.9064049
2019-01-01
Abstract:The training of Deep Learning (DL) model can be accelerated significantly with Graphics Processing Unit (GPU). Accessing a server with GPUs attached and training model directly on it is the most common way of using GPUs, but this way may cause unbalanced load and low GPU utilization in multiple-server-multiple-application situation. This paper gives a mathematical definition of this problem and presents a YARN-based solution. Key of the solution is to build a resource scheduling platform with a customized Application Master (AM) optimized for DL. Performance model of the platform is also presented in this paper as well as the proof of higher GPU utilization. The benchmark of training ResNet on ImageNet shows that there is little performance degradation on platform.
What problem does this paper attempt to address?