Interference-Aware Opportunistic Job Placement for Shared Distributed Deep Learning Clusters

Hongliang Li,Hairui Zhao,Ting Sun,Xiang Li,Haixiao Xu,Keqin Li
DOI: https://doi.org/10.1016/j.jpdc.2023.104776
IF: 4.542
2023-01-01
Journal of Parallel and Distributed Computing
Abstract:Distributed deep learning frameworks facilitate large deep learning workloads. These frameworks support sharing one GPU device among multiple jobs to improve resource utilization. Modern deep learning training jobs consume a large amount of GPU memory. Despite that, sharing GPU memory among jobs is still possible because a training job has iterative steps that its memory usage fluctuates over time. However, resource sharing also introduces the risk of job performance degradation. Co-located jobs sharing a GPU device may suffer from different levels of interference, mainly caused by memory oversharing. How to improve resource utilization while maintaining good job performance is a novel challenge for job placement strategies. This paper studies the job placement problem. We propose an opportunistic memory sharing model to describe the time-varying job memory requirements. Based on this model, we introduce an Opportunistic Job Placement Problem (OJPP) for shared GPU clusters that seek job placement configurations using a minimum number of GPU devices and guarantee user-defined performance requirements at the same time. We propose a greedy algorithm and a heuristic algorithm with computational complexities of O(n log n) and O(n2log n), respectively, to solve the problem. We also propose an online adjustment algorithm with the computational complexity of O(n log n) to perform updates to job placement configurations in runtime. A machine-learning-based interference prediction method is used to prepare accurate interference estimations. Extensive experiments are conducted on a GPU cluster to verify the correctness and effectiveness of our algorithms. Compared with standalone training jobs on dedicated clusters, the proposed approach reduces resource consumption by 46% in a shared cluster, while guaranteeing over 92.97% of the job performance, in terms of average job completion time.
What problem does this paper attempt to address?