Dynamic Resource Allocation for Deep Learning Clusters with Separated Compute and Storage
Zhenhua Han,Ruiting Zhou,Yuanchi Liu,Haisheng Tan,Chi Zhang,Mingxia Li
DOI: https://doi.org/10.1109/INFOCOM53939.2023.10228920
2023-05-17
Abstract:The separation of compute and storage in modern cloud services eases the deployment of general applications. However, with the development of accelerators such as GPU/TPU, Deep Learning (DL) training is suffering from potential IO bottlenecks when loading data from storage clusters. Therefore, DL training jobs need to either create local cache in the compute cluster to reduce the bandwidth demands or scale up the IO capacity with higher bandwidth cost. It is full of challenges to choose the best strategy due to the heterogeneous cache/IO preference of DL models, shared dataset among multiple jobs and dynamic GPU scaling of DL training. In this work, we exploit the job characteristics based on their training throughput, dataset size and scalability. For fixed GPU allocation of jobs, we propose CBA to minimize the training cost with a closed-form approach. For clusters that can automatically scale the GPU allocations of jobs, we extend CBA to AutoCBA to support diverse job utility functions and improve social welfare within a limited budget. Extensive experiments with production traces validate that CBA and AutoCBA can reduce IO cost and improve total social welfare by up to 20.5% and 2.27×, respectively, over the state-of-the-art schedulers for DL training.
Engineering,Computer Science