Abstract:Distributed machine learning (DML) is an increasingly important workload. In a DML job, each communication phase can comprise a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">coflow , and there are dependencies among its coflows. Thus, efficient coflow scheduling becomes critical for DML jobs. However, the majority of existing solutions focus on scheduling single-stage coflows with no dependencies. While there are a few studies schedule dependent coflows of multi-stage jobs, they suffer from either practical or theoretical issues. Motivated by this situation, we study how to schedule dependent coflows of multiple DML jobs to minimize the total JCT in a shared cluster. We present a formal mathematical formulation for this problem and prove its NP-hardness. To solve this problem without job size information, we present an online coflow-aware optimization framework called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Parrot . The core idea in <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Parrot is to infer the job with the shortest remaining processing time (SRPT) each time and dynamically control the inferred job's bandwidth based on how confident it is an SRPT job while being mindful of not starving any other job. Specifically, in the design of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Parrot , we present a least per-coflow attained service (LPCAS) policy to infer the SRPT job. We further propose a dynamic job weight assignment mechanism and a linear program (LP) based weighted bandwidth scaling strategy for sharing bandwidth among DML jobs. We have proved that <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Parrot algorithm has a non-trivial competitive ratio. The results from large-scale trace-driven simulations further demonstrate that our <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Parrot can reduce the total JCT by up to 58.4 percent, compared to the state-of-the-art Aalo solution.

Efficient Online Scheduling for Coflow-Aware Machine Learning Clusters

Efficient Scheduling for Multi-Stage Coflows

Efficient Online Coflow Routing And Scheduling

Cross-Layer Self-Similar Coflow Scheduling for Machine Learning Clusters.

Joint Online Coflow Routing and Scheduling in Data Center Networks

Efficient Scheduling of Weighted Coflows in Data Centers

Scheduling Dependent Coflows to Minimize the Total Weighted Job Completion Time in Datacenters.

Online job scheduling for distributed machine learning in optical circuit switch networks

A Scalable Deep Reinforcement Learning Model for Online Scheduling Coflows of Multi-Stage Jobs for High Performance Computing

Fast Coflow Scheduling Via Traffic Compression and Stage Pipelining in Datacenter Networks

Efficient and Fair: Information-Agnostic Online Coflow Scheduling by Combining Limited Multiplexing with DRL

Scheduling Coflows of Multi-Stage Jobs under Network Resource Constraints.

Coflow Scheduling in Data Centers: Routing and Bandwidth Allocation

Leveraging Endpoint Flexibility when Scheduling Coflows across Geo-distributed Datacenters.

Beamer: Stage-Aware Coflow Scheduling to Accelerate Hyper-Parameter Tuning in Deep Learning Clusters

Towards Practical and Near-Optimal Coflow Scheduling for Data Center Networks

Metaflow: A Better Traffic Abstraction for Distributed Applications.

Online Scheduling Algorithm for Heterogeneous Distributed Machine Learning Jobs

Scheduling Mix-Coflows in Datacenter Networks

Application-Oriented Network Scheduling with Metaflow

Endpoint-Flexible Coflow Scheduling Across Geo-Distributed Datacenters