Vapor: A GPU Sharing Scheduler with Communication and Computation Pipeline for Distributed Deep Learning

Xiaorui Zhu,Lei Gong,Zongwei Zhu,Xuehai Zhou
DOI: https://doi.org/10.1109/ispa-bdcloud-socialcom-sustaincom52081.2021.00028
2021-01-01
Abstract:Nowadays, distributed deep learning (DDL) workloads are typically trained on GPU clusters. Modern GPUs support the concurrent execution of multiple jobs to achieve efficient parallelization of computation and communication. While sharing GPU improves resource utilization and cuts down job completion time, interference among co-located jobs results in significant performance degradation. Recent studies such as interference-aware job placement carefully co-locate jobs with the least interference in a best-effort manner and are not ideal. Another challenge in GPU sharing is straggler due to unbalanced workloads in packing. When stragglers occur, GPU resources are wasted, so the training processes are delayed.We present Vapor, a GPU sharing scheduler for distributed deep learning on multi-GPUs featured by two novel scheduling policies: preemptive GPU sharing and adaptive batch redistribution to maximize GPU utilization and improve training efficiency. Firstly, Vapor uses preemptive scheduling that parallels computation and communication of co-located jobs in a pipelined manner. Secondly, Vapor uses an adaptive batch redistribution method to deal with stragglers in packing to improve resource utilization further. Vapor provides an AIMD model to predict the relationship between the batch size of the training data and the model calculation time in each iteration. We evaluate Vapor with other representative schedulers. Experiment on a Kubernetes cluster of 16 Tesla V100 GPU handling with popular DDL jobs shows that Vapor reduces job completion time by 21.8% compared with popular SOTA schedulers.
What problem does this paper attempt to address?