Adaptive Pricing and Online Scheduling for Distributed Machine Learning Jobs
Yafei Wang,Lina Su,Junmei Chen,Ne Wang,Zongpeng Li
DOI: https://doi.org/10.1109/jiot.2023.3336757
IF: 10.6
2024-01-01
IEEE Internet of Things Journal
Abstract:Large-scale distributed machine learning (ML) systems involve extensive and costly computational resources. Pricing and scheduling, as two promising techniques for resource management, have garnered significant attention. However, existing job pricing and scheduling algorithms in cloud computing either charge fixed resource fees based on known job runtime or implement dynamic price setting with job preemption, unsuitable for distributed ML systems with high uncertainties and switching cost. First, whether the resources of a distributed ML job are placed together or not results in different job runtime. Second, various time-varying factors, including job arrival rates and competitors’ pricing, affect resource prices. Third, frequent price changes for the same resource can easily lead to system instability, ultimately jeopardizing user satisfaction. Addressing these uncertainties is challenging. This paper introduces APOS, an adaptive pricing and online scheduling framework, aiming at maximizing the operator’s overall revenue. APOS incorporates two innovations. 1) Intelligent Pricing: We represent each price using a feature vector that encapsulates relevant factors. Subsequently, based on the linear Upper Confidence Bound (UCB) techniques, we establish relationships between price features and two revenue-associated elements: job arrival rates and resource consumption rates. To ensure system stability, we introduce batch pricing to reduce the frequency of resource price updates. 2) Online Scheduling: We strive to compute a non-preemptive schedule that balances job utility with corresponding resource cost. We rigorously prove that APOS achieves truthfulness, individual rationality, system stability, and sublinear regret in polynomial time. Finally, extensive trace-driven simulations confirm that APOS outperforms four state-of-the-art baselines, yielding a minimum of 23.3% improvement in total operator revenue.