Elastic Deep Learning in Multi-Tenant GPU Clusters
Yidi Wu,Kaihao Ma,Xiao Yan,Zhi Liu,Zhenkun Cai,Yuzhen Huang,James Cheng,Han Yuan,Fan Yu
DOI: https://doi.org/10.1109/tpds.2021.3064966
IF: 5.3
2022-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:We study how to support elasticity, that is, the ability to dynamically adjust the parallelism (i.e., the number of GPUs), for deep neural network (DNN) training in a GPU cluster. Elasticity can benefit multi-tenant GPU cluster management in many ways, for example, achieving various scheduling objectives (e.g., job throughput, job completion time, GPU efficiency) according to cluster load variations, utilizing transient idle resources, and supporting performance profiling, job migration, and straggler mitigation. We propose EDL, which enables elastic deep learning with a simple API and can be easily integrated with existing deep learning frameworks such as TensorFlow and PyTorch. EDL also incorporates techniques that are necessary to reduce the overhead of parallelism adjustments, such as stop-free scaling and dynamic data pipeline. We demonstrate with experiments that EDL can indeed bring significant benefits to the above-listed applications in GPU cluster management.
computer science, theory & methods,engineering, electrical & electronic