Elastic Scheduler: Heterogeneous and Dynamic Deep Learning in the Cloud.

Lujia Yin,Yiming Zhang,Yuxing Peng,Dongsheng Li
DOI: https://doi.org/10.1002/cpe.6206
2021-01-01
Concurrency and Computation Practice and Experience
Abstract:GPUs and CPUs have been widely used for model training of deep learning (DL) in the cloud, where both DL workloads and resource usage might heavily change over time. Traditional training methods require beforehand specification on the type (either GPUs or CPUs) and amount of computing devices, and thus cannot elastically schedule the dynamic DL workloads onto available GPUs/CPUs. In this paper, we propose Elastic Scheduler (ES), a novel approach that efficiently supports both heterogeneous training (with different device types) and dynamic training (with varying device numbers). ES (i) accumulates local gradients and simulates multiple virtual workers on one GPU to alleviate the performance gap between GPUs and CPUs for achieving similar accuracy in heterogeneous GPU‐CPU‐hybrid training as in homogeneous training and (ii) uses local gradients stabilizes batch sizes for high accuracy without long compensation. Experiments show that ES achieves significantly higher performance than existing methods for heterogeneous and dynamic training as well as inference.
What problem does this paper attempt to address?