Abstract:As the quantity and complexity of information processed by software systems increase, large-scale software systems have an increasing requirement for high-performance distributed computing systems. With the acceleration of the Internet in Web 2.0, Cloud computing as a paradigm to provide dynamic, uncertain and elastic services has shown superiorities to meet the computing needs dynamically. Without an appropriate scheduling approach, extensive Cloud computing may cause high energy consumptions and high cost, in addition that high energy consumption will cause massive carbon dioxide emissions. Moreover, inappropriate scheduling will reduce the service life of physical devices as well as increase response time to users' request. Hence, efficient scheduling of resource or optimal allocation of request, that usually a NP-hard problem, is one of the prominent issues in emerging trends of Cloud computing. Focusing on improving quality of service (QoS), reducing cost and abating contamination, researchers have conducted extensive work on resource scheduling problems of Cloud computing over years. Nevertheless, growing complexity of Cloud computing, that the super-massive distributed system, is limiting the application of scheduling approaches. Machine learning, a utility method to tackle problems in complex scenes, is used to resolve the resource scheduling of Cloud computing as an innovative idea in recent years. Deep reinforcement learning (DRL), a combination of deep learning (DL) and reinforcement learning (RL), is one branch of the machine learning and has a considerable prospect in resource scheduling of Cloud computing. This paper surveys the methods of resource scheduling with focus on DRL-based scheduling approaches in Cloud computing, also reviews the application of DRL as well as discusses challenges and future directions of DRL in scheduling of Cloud computing.

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Energy-Efficient GPU Clusters Scheduling for Deep Learning

SCHED²: Scheduling Deep Learning Training Via Deep Reinforcement Learning.

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

On a Meta Learning-based Scheduler for Deep Learning Clusters

Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Energy-Aware Non-Preemptive Task Scheduling with Deadline Constraint in DVFS-Enabled Heterogeneous Clusters

Deep Reinforcement Learning-Based Workload Scheduling for Edge Computing

Energy-aware Task Scheduling with Deadline Constraint in DVFS-enabled Heterogeneous Clusters

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud Via Reinforcement Learning

GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads

Deep Reinforcement Learning-based Methods for Resource Scheduling in Cloud Computing: A Review and Future Directions

A Survey of Multi-Tenant Deep Learning Inference on GPU

Learning Interpretable Scheduling Algorithms for Data Processing Clusters