Job Placement Strategy with Opportunistic Resource Sharing for Distributed Deep Learning Clusters

Hongliang Li,Ting Sun,Xiang Li,Haixiao Xu
DOI: https://doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00079
2020-01-01
Abstract:Distributed deep learning frameworks train large deep leaning workload with multiple training jobs on shared distributed GPU servers. There are new challenges when scheduling resources for these systems. Modern deep learning training jobs tend to consume large amount of GPU memory. A training job has an iterative nature that causes the memory usage fluctuate overtime. Jobs sharing a host may suffe...
What problem does this paper attempt to address?