Kale: Elastic GPU Scheduling for Online DL Model Training

Ziyang Liu,Renyu Yang,Jin Ouyang,Weihan Jiang,Tianyu Ye,Menghao Zhang,Sui Huang,Jiaming Huang,Chengru Song,Di Zhang,Tianyu Wo,Chunming Hu
DOI: https://doi.org/10.1145/3698038.3698532
2024-01-01
Abstract:Large-scale GPU clusters have been widely used for effectively training both online and offline deep learning (DL) jobs. However, elastic scheduling in most cases of resource schedulers is dedicated for offline model training where resource adjustment is planned ahead of time. The native autoscaling policy is on the basis of pre-defined threshold and, if applied directly in online model training, often suffers from belated resource adjustment, leading to diminished model accuracy. In this paper, we present Kale, a novel elastic GPU scheduling system to improve the performance of online DL model training. Through traffic forecasting and resource-throughput modeling, Kale automatically pinpoints the number of required GPUs that best accommodate the on-the-fly data samples before performing stabilized autoscaling. An advanced data shuffling strategy is further employed for balancing uneven samples among different training workers, thereby improving the runtime efficacy. Experiments show that Kale substantially outperforms the state-of-the-art solutions. Compared with the default HPA autoscaling strategy, Kale reduces the accumulated lag and downtime by 69.2% and 33.1%, respectively, whilst lowering the SLO violation rate from 19.57% to just 2.6%. Kale has been deployed at Kuaishou's production-level GPU clusters and successfully underpins real-time video recommendation and advertisement at scale.
What problem does this paper attempt to address?