Efficient Online Scheduling for Coflow-Aware Machine Learning Clusters

Wenxin Li,Sheng Chen,Keqiu Li,Heng Qi,Renhai Xu,Song Zhang
DOI: https://doi.org/10.1109/TCC.2020.3040312
IF: 5.697
2022-01-01
IEEE Transactions on Cloud Computing
Abstract:Distributed machine learning (DML) is an increasingly important workload. In a DML job, each communication phase can comprise a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">coflow</i> , and there are dependencies among its coflows. Thus, efficient coflow scheduling becomes critical for DML jobs. However, the majority of existing solutions focus on scheduling single-stage coflows with no dependencies. While there are a few studies schedule dependent coflows of multi-stage jobs, they suffer from either practical or theoretical issues. Motivated by this situation, we study how to schedule dependent coflows of multiple DML jobs to minimize the total JCT in a shared cluster. We present a formal mathematical formulation for this problem and prove its NP-hardness. To solve this problem without job size information, we present an online coflow-aware optimization framework called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Parrot</i> . The core idea in <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Parrot</i> is to infer the job with the shortest remaining processing time (SRPT) each time and dynamically control the inferred job's bandwidth based on how confident it is an SRPT job while being mindful of not starving any other job. Specifically, in the design of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Parrot</i> , we present a least per-coflow attained service (LPCAS) policy to infer the SRPT job. We further propose a dynamic job weight assignment mechanism and a linear program (LP) based weighted bandwidth scaling strategy for sharing bandwidth among DML jobs. We have proved that <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Parrot</i> algorithm has a non-trivial competitive ratio. The results from large-scale trace-driven simulations further demonstrate that our <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Parrot</i> can reduce the total JCT by up to 58.4 percent, compared to the state-of-the-art Aalo solution.
What problem does this paper attempt to address?