Abstract:We consider a market-based resource allocation model for batch jobs in cloud computing clusters. In our model, we incorporate the importance of the due date of a job rather than the number of servers allocated to it at any given time. Each batch job is characterized by the work volume of total computing units (e.g., CPU hours) along with a bound on maximum degree of parallelism. Users specify, along with these job characteristics, their desired due date and a value for finishing the job by its deadline. Given this specification, the primary goal is to determine the scheduling of cloud computing instances under capacity constraints in order to maximize the social welfare (i.e., sum of values gained by allocated users). Our main result is a new (CC-kċss-1)-approximation algorithm for this objective, where C denotes cloud capacity, k is the maximal bound on parallelized execution (in practical settings, k < C ) and s is the slackness on the job completion time, that is, the minimal ratio between a specified deadline and the earliest finish time of a job. Our algorithm is based on utilizing dual fitting arguments over a strengthened linear program to the problem. Based on the new approximation algorithm, we construct truthful allocation and pricing mechanisms, in which reporting the true value and other properties of the job (deadline, work volume, and the parallelism bound) is a dominant strategy for all users. To that end, we extend known results for single-value settings to provide a general framework for transforming allocation algorithms into truthful mechanisms in domains of single-value and multi-properties. We then show that the basic mechanism can be extended under proper Bayesian assumptions to the objective of maximizing revenues, which is important for public clouds. We empirically evaluate the benefits of our approach through simulations on data-center job traces, and show that the revenues obtained under our mechanism are comparable with an ideal fixed-price mechanism, which sets an on-demand price using oracle knowledge of users’ valuations. Finally, we discuss how our model can be extended to accommodate uncertainties in job work volumes, which is a practical challenge in cloud settings.

Online Scheduling of Distributed Machine Learning Jobs for Incentivizing Sharing in Multi-Tenant Systems

A Novel Job Scheduling Model to Enhance Efficiency and Overall User Fairness of Cloud Computing Environment.

Online Scheduling Algorithm for Heterogeneous Distributed Machine Learning Jobs

Adaptive Pricing and Online Scheduling for Distributed Machine Learning Jobs

Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters

Online Job Scheduling in Distributed Machine Learning Clusters

Dynamic Pricing and Placing for Distributed Machine Learning Jobs: an Online Learning Approach.

Online Placement and Scaling of Geo-Distributed Machine Learning Jobs Via Volume-Discounting Brokerage

Online Scheduling of Machine Learning Jobs in Edge-Cloud Networks

A SLA-based Scheduling Approach for Multi-Tenant Cloud Simulation

Incentive-Aware Resource Allocation for Multiple Model Owners in Federated Learning

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

SLA-Aware Tenant Placement and Dynamic Resource Provision in SaaS

Knowledge-Based Resource Allocation for Collaborative Simulation Development in a Multi-Tenant Cloud Computing Environment

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Near-Optimal Scheduling Mechanisms for Deadline-Sensitive Jobs in Large Computing Clusters

An Auction-Based Approach for Multi-Agent Uniform Parallel Machine Scheduling with Dynamic Jobs Arrival

Preemptive Scheduling for Distributed Machine Learning Jobs in Edge-Cloud Networks

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

DPS: Dynamic Pricing and Scheduling for Distributed Machine Learning Jobs in Edge-Cloud Networks