JetEsti: A New DLT Job Scheduling Simulator Based on Fine-Grained Process Modeling.

Yongzhe He,Yueyuan Zhou,En Shao,Guangming Tan,Ninghui Sun
DOI: https://doi.org/10.1109/icdcs57875.2023.00101
2023-01-01
Abstract:Large-scale Deep Learning Training(DLT) jobs consume a large amount of time and are usually carried out in a distributed cluster environment. However, existing DLT framework like TensorFlow does not contain adhoc optimizations at parallelism and scheduling, which results in seriously low efficiency. Due to this problem, researchers need to choose appropriate scheduling algorithms for cluster jobs. Consider the expensiveness of hardware resources, using job scheduling simulator(JSS) to verify the performance of different scheduling algorithms in advance is necessary.
What problem does this paper attempt to address?