Abstract:Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks.We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPU allocation mechanism works in concert with the scheduler. It offers a lightweight and nonintrusive method to reallocate GPUs based on a “SideCar” process that temporarily stops and restarts the job’s DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU clusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.

Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

SCHED²: Scheduling Deep Learning Training Via Deep Reinforcement Learning.

On a Meta Learning-based Scheduler for Deep Learning Clusters

An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems.

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud Via Reinforcement Learning

RIFLING: A Reinforcement Learning‐based GPU Scheduler for Deep Learning Research and Development Platforms

Learning Interpretable Scheduling Algorithms for Data Processing Clusters

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters.

Deep learning task scheduling method based on reinforcement learning

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Efficient Cloud Cluster Resource Scheduling with Deep Reinforcement Learning

Energy-Efficient GPU Clusters Scheduling for Deep Learning

SchedTune: A Heterogeneity-Aware GPU Scheduler for Deep Learning

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

Enhancing Kubernetes Automated Scheduling with Deep Learning and Reinforcement Techniques for Large-Scale Cloud Computing Optimization

Performance Efficient Layer-aware DNN Inference Task Scheduling in GPU Cluster.