E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster

Abeda Sultana,Li Chen,Fei Xu,Xu Yuan
DOI: https://doi.org/10.1145/3404397.3404415
2020-01-01
Abstract:With the prosperity of deep learning, enterprises, and large platform providers, such as Microsoft, Amazon, and Google, have built and provided GPU clusters to facilitate distributed deep learning training. As deep learning training workloads are heterogeneous, with a diverse range of characteristics and resource requirements, it becomes increasingly crucial to design an efficient and optimal scheduler for distributed deep learning jobs in the GPU cluster. This paper aims to propose a simple and yet effective scheduler, called E-LAS, with the objective of reducing the averaged training completion time of deep learning jobs. Without relying on the estimation or prior knowledge of the job running time, E-LAS leverages the real-time epoch progress rate, unique for distributed deep learning training jobs, as well as the attained services from temporal and spatial domains, to guide the scheduling decisions. The theoretical analysis for E-LAS is conducted to offer a deeper understanding on the components of scheduling criteria. Furthermore, we present a placement algorithm to achieve better resource utilization without involving much implementation overhead, complementary to the scheduling algorithm. Extensive simulations have been conducted, demonstrating that E-LAS improves the averaged job completion time (JCT) by 10 × over an Apache YARN-based resource manager used in production. Moreover, E-LAS outperforms Tiresias, the state-of-the-art scheduling algorithm customized for deep learning jobs, by almost 1.5 × for the average JCT as well as queuing time.
What problem does this paper attempt to address?