Abstract:Deep learning is popular in many areas, but users must manually specify the resource configuration when submitting deep learning training jobs, usually over-provisioning resources. This kind of unreasonable resource configuration method results in slow training and low resource utilization. Therefore, it would be more convenient and efficient if users only need to specify the quality of service (QoS) for their jobs, and then the resources will be autoconfigured to meet the QoS. To satisfy this demand, we present a QoS-oriented scheduling and autoscaling framework that schedules and autoscales deep learning training jobs in the Kubernetes cluster. This paper focuses on the most important QoS requirement for deep learning training jobs: deadline.The goal of the framework is to guarantee that as many jobs as possible can be accomplished before their specified deadlines. To reach this goal, the framework schedules deep learning jobs by implementing a heuristic scheduling policy based on resource status and job deadline, and autoscales resource configuration by exploiting a characteristic of deep learning jobs: the predictability of training time. This predictability is used to predict whether a job can be accomplished before its deadline and estimate appropriate resource configuration if necessary.We implemented the framework by modifying the default scheduler of Kubernetes and conducted experiments to evaluate its performance. The experiment results show that our scheduling policy can improve the completion rate by 26% when the cluster resources are insufficient, and our autoscaling policy can improve the completion rate to 100% when the cluster resources are sufficient. We also show that the framework improves the utilization of allocated CPUs to 100%. Our proposed framework points to a new way of submitting and managing deep learning training jobs in the cluster.

An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems.

SchedTune: A Heterogeneity-Aware GPU Scheduler for Deep Learning

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Energy-Efficient GPU Clusters Scheduling for Deep Learning

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

A QoS-oriented Scheduling and Autoscaling Framework for Deep Learning

Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

GPU Shared Scheduling System Under Deep Learning Container Cloud Platform

Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

AutoSched: an Adaptive Self-configured Framework for Scheduling Deep Learning Training Workloads

SCHED²: Scheduling Deep Learning Training Via Deep Reinforcement Learning.

Implementation of GPU Scheduling Method for Kubernetes

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

GScheduler: Optimizing Resource Provision by Using GPU Usage Pattern Extraction in Cloud Environments

Dynamic Resource Allocation for Deep Learning Clusters with Separated Compute and Storage

Online Scheduling for Exploratory Training Jobs in Deep Learning Clusters