Abstract:Large-scale distributed machine learning (ML) systems involve extensive and costly computational resources. Pricing and scheduling, as two promising techniques for resource management, have garnered significant attention. However, existing job pricing and scheduling algorithms in cloud computing either charge fixed resource fees based on known job runtime or implement dynamic price setting with job preemption, unsuitable for distributed ML systems with high uncertainties and switching cost. First, whether the resources of a distributed ML job are placed together or not results in different job runtime. Second, various time-varying factors, including job arrival rates and competitors’ pricing, affect resource prices. Third, frequent price changes for the same resource can easily lead to system instability, ultimately jeopardizing user satisfaction. Addressing these uncertainties is challenging. This paper introduces APOS, an adaptive pricing and online scheduling framework, aiming at maximizing the operator’s overall revenue. APOS incorporates two innovations. 1) Intelligent Pricing: We represent each price using a feature vector that encapsulates relevant factors. Subsequently, based on the linear Upper Confidence Bound (UCB) techniques, we establish relationships between price features and two revenue-associated elements: job arrival rates and resource consumption rates. To ensure system stability, we introduce batch pricing to reduce the frequency of resource price updates. 2) Online Scheduling: We strive to compute a non-preemptive schedule that balances job utility with corresponding resource cost. We rigorously prove that APOS achieves truthfulness, individual rationality, system stability, and sublinear regret in polynomial time. Finally, extensive trace-driven simulations confirm that APOS outperforms four state-of-the-art baselines, yielding a minimum of 23.3% improvement in total operator revenue.

Online Scheduling Algorithm for Heterogeneous Distributed Machine Learning Jobs

Towards Efficient Scheduling of Federated Mobile Devices under Computational and Statistical Heterogeneity

Online Job Scheduling in Distributed Machine Learning Clusters

Online Scheduling Of Equal-Length Jobs On Parallel Machines

On-Line Scheduling of Parallel Jobs in Heterogeneous Multiple Clusters

Online Scheduling of Machine Learning Jobs in Edge-Cloud Networks

A Novel Job Scheduling Model to Enhance Efficiency and Overall User Fairness of Cloud Computing Environment.

Online Scheduling of Mixed CPU-GPU Jobs

On-line Scheduling of Parallel Jobs on Two Machines

Online Scheduling on a CPU-GPU Cluster

Online Scheduling of Distributed Machine Learning Jobs for Incentivizing Sharing in Multi-Tenant Systems

Preemptive Scheduling for Distributed Machine Learning Jobs in Edge-Cloud Networks

Adaptive Pricing and Online Scheduling for Distributed Machine Learning Jobs

Online Placement and Scaling of Geo-Distributed Machine Learning Jobs Via Volume-Discounting Brokerage

Online Flexible Busy Time Scheduling on Heterogeneous Machines

Online job scheduling for distributed machine learning in optical circuit switch networks

Reinforcement Learning Based Online Scheduling of Multiple Workflows in Edge Environment

Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Online Approximation Scheme for Scheduling Heterogeneous Utility Jobs in Edge Computing

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud Via Reinforcement Learning