Abstract:Scheduling job flows efficiently and rapidly on distributed computing clusters is one of huge challenges for daily operation of data centers. In a practical scenario, a single job consists of numerous stages with complex dependency relation represented as a Directed Acyclic Graph (DAG) structure. Nowadays a data center usually equips with a cluster of heterogeneous computing servers which are different in the hardware/software configuration. From both the cost saving and environmental friendliness, the data centers could benefit a lot from optimizing the job scheduling problems in the heterogeneous environment. Thus the problem has attracted more and more attention from both the industry and academy. In this paper, we propose a task-duplication based learning algorithm, namely \lachesis \footnote{The second of the Three Fates in ancient Greek mythology, who determines destiny.}, aiming to optimize the problem. In the proposed approach, it first perceives the topological dependencies between jobs using a reinforcement learning framework and a specially designed graph neural network (GNN) to select the most promising task to be executed. Then the task is assigned to a specific executor with the consideration of duplicating all its precedent tasks according to an expert-designed rules. We have conducted extensive experiments over standard workloads to evaluate the proposed solution. The experimental results suggest that \lachesisquad can achieve at most 26.7\% reduction of makespan and 35.2\% improvement of speedup ratio over seven strong baseline algorithms, including the state-of-the-art heuristics methods and a variety of deep reinforcement learning based algorithms.

Learning to Schedule DAG Tasks

Learning to Optimize DAG Scheduling in Heterogeneous Environment

An Improving List Scheduling Algorithm Based on Reinforcement Learning and Task Duplication

A Scheduling Algorithm Based on Reinforcement Learning for Heterogeneous Environments.

Edge Generation Scheduling for DAG Tasks Using Deep Reinforcement Learning

A Reinforcement Learning Based Job Scheduling Algorithm for Heterogeneous Computing Environment

Octopus: an End-to-end Multi-DAG Scheduling Method Based on Deep Reinforcement Learning

Learning Interpretable Scheduling Algorithms for Data Processing Clusters

An Adaptive Priority-Based Heuristic Approach for Scheduling DAG Applications with Uncertainties.

Directed Acyclic Task Graph Scheduling for Heterogeneous Computing Systems by Dynamic Critical Path Duplication Algorithm

DGCQN: a RL and GCN combined method for DAG scheduling in edge computing

Optimizing Schedule Length for DAG Type Applications with Energy Consumption Constraint in Heterogeneous Computing Systems

Dag-Based Parallel Real Time Task Scheduling Algorithm On A Cluster

Telemetry-aided cooperative multi-agent online reinforcement learning for DAG task scheduling in computing power networks

GARLSched: Generative Adversarial Deep Reinforcement Learning Task Scheduling Optimization for Large-Scale High Performance Computing Systems

A Path Relinking Enhanced Estimation of Distribution Algorithm for Direct Acyclic Graph Task Scheduling Problem.

A Novel Task-Duplication based DAG Scheduling Algorithm for Heterogeneous Environments

A Grid DAG Scheduling Algorithm Based on Fuzzy Clustering

DAG Scheduling for Heterogeneous Systems Using Biogeography-Based Optimization.

A hybrid genetic algorithm for tasks scheduling in heterogeneous computing systems

An Enhanced Priority-Based Scheduling Heuristic for DAG Applications with Temporal Unpredictability in Task Execution and Data Transmission.