Abstract:Deep learning is popular in many areas, but users must manually specify the resource configuration when submitting deep learning training jobs, usually over-provisioning resources. This kind of unreasonable resource configuration method results in slow training and low resource utilization. Therefore, it would be more convenient and efficient if users only need to specify the quality of service (QoS) for their jobs, and then the resources will be autoconfigured to meet the QoS. To satisfy this demand, we present a QoS-oriented scheduling and autoscaling framework that schedules and autoscales deep learning training jobs in the Kubernetes cluster. This paper focuses on the most important QoS requirement for deep learning training jobs: deadline.The goal of the framework is to guarantee that as many jobs as possible can be accomplished before their specified deadlines. To reach this goal, the framework schedules deep learning jobs by implementing a heuristic scheduling policy based on resource status and job deadline, and autoscales resource configuration by exploiting a characteristic of deep learning jobs: the predictability of training time. This predictability is used to predict whether a job can be accomplished before its deadline and estimate appropriate resource configuration if necessary.We implemented the framework by modifying the default scheduler of Kubernetes and conducted experiments to evaluate its performance. The experiment results show that our scheduling policy can improve the completion rate by 26% when the cluster resources are insufficient, and our autoscaling policy can improve the completion rate to 100% when the cluster resources are sufficient. We also show that the framework improves the utilization of allocated CPUs to 100%. Our proposed framework points to a new way of submitting and managing deep learning training jobs in the cluster.

ElasticDL: A Kubernetes-native Deep Learning Framework with Fault-tolerance and Elastic Scheduling

Effective Elastic Scaling of Deep Learning Workloads

ElasticFlow: An elastic serverless training platform for distributed deep learning

ElasticFlow: an Elastic Serverless Training Platform for Distributed Deep Learning.

Elastic Deep Learning in Multi-Tenant GPU Clusters

XDL: an industrial deep learning framework for high-dimensional sparse data

A QoS-oriented Scheduling and Autoscaling Framework for Deep Learning

Elastic Scheduler: Heterogeneous and Dynamic Deep Learning in the Cloud.

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

FfDL : A Flexible Multi-tenant Deep Learning Platform

An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems.

An Improved Kubernetes Scheduling Algorithm for Deep Learning Platform

Elan: Towards Generic and Efficient Elastic Training for Deep Learning

BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster

Elastic Federated Learning with Kubernetes Vertical Pod Autoscaler for edge computing

BigDL: A Distributed Deep Learning Framework for Big Data

Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

Enhancing Kubernetes Automated Scheduling with Deep Learning and Reinforcement Techniques for Large-Scale Cloud Computing Optimization

Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge