Abstract: Coflow is a recently proposed networking abstraction to help improve the communication performance of data-parallel computing jobs. In multi-stage jobs, each job consists of multiple coflows and is represented by a Directed Acyclic Graph (DAG). Efficiently scheduling coflows is critical to improve the data-parallel computing performance in data centers. Compared with hand-tuned scheduling heuristics, existing work DeepWeave [1] utilizes Reinforcement Learning (RL) framework to generate highly-efficient coflow scheduling policies automatically. It employs a graph neural network (GNN) to encode the job information in a set of embedding vectors, and feeds a flat embedding vector containing the whole job information to the policy network. However, this method has poor scalability as it is unable to cope with jobs represented by DAGs of arbitrary sizes and shapes, which requires a large policy network for processing a high-dimensional embedding vector that is difficult to train. In this paper, we first utilize a directed acyclic graph neural network (DAGNN) to process the input and propose a novel Pipelined-DAGNN, which can effectively speed up the feature extraction process of the DAGNN. Next, we feed the embedding sequence composed of schedulable coflows instead of a flat embedding of all coflows to the policy network, and output a priority sequence, which makes the size of the policy network depend on only the dimension of features instead of the product of dimension and number of nodes in the job's DAG.Furthermore, to improve the accuracy of the priority scheduling policy, we incorporate the Self-Attention Mechanism into a deep RL model to capture the interaction between different parts of the embedding sequence to make the output priority scores relevant. Based on this model, we then develop a coflow scheduling algorithm for online multi-stage jobs.

GARLSched: Generative Adversarial Deep Reinforcement Learning Task Scheduling Optimization for Large-Scale High Performance Computing Systems

RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning

SCHED²: Scheduling Deep Learning Training Via Deep Reinforcement Learning.

Enhancing Kubernetes Automated Scheduling with Deep Learning and Reinforcement Techniques for Large-Scale Cloud Computing Optimization

A2C-DRL: Dynamic Scheduling for Stochastic Edge-Cloud Environments Using A2C and Deep Reinforcement Learning

Reinforcement Learning for Adaptive Resource Scheduling in Complex System Environments

A Scalable Deep Reinforcement Learning Model for Online Scheduling Coflows of Multi-Stage Jobs for High Performance Computing

On a Meta Learning-based Scheduler for Deep Learning Clusters

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud Via Reinforcement Learning

Learning Interpretable Scheduling Algorithms for Data Processing Clusters

HeterPS: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments

Learning to Schedule DAG Tasks

Data Centers Job Scheduling with Deep Reinforcement Learning

Deep Reinforcement Learning Assisted Genetic Programming Ensemble Hyper-Heuristics for Dynamic Scheduling of Container Port Trucks

A HPC Co-Scheduler with Reinforcement Learning

Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A Multi-Agent Reinforcement Learning Approach

An Improving List Scheduling Algorithm Based on Reinforcement Learning and Task Duplication

Genetic Programming and Reinforcement Learning on Learning Heuristics for Dynamic Scheduling: A Preliminary Comparison

Multi-Agent Deep Reinforcement Learning-Based Resource Allocation in HPC/AI Converged Cluster

RLPTO : A reinforcement learning-based performance-time optimized task and resource scheduling mechanism for distributed machine learning