To store or not: Online cost optimization for running big data jobs on the cloud

Xiankun Fu,Li Pan,Shijun Liu
DOI: https://doi.org/10.1016/j.future.2024.03.003
IF: 7.307
2024-03-03
Future Generation Computer Systems
Abstract:As businesses increasingly rely on cloud-based big data analytics services to drive insights, reducing the cost of storing and analyzing large volumes of data in the cloud has become a major concern. During the execution of big data analysis jobs, some of the generated data can be reused by subsequent jobs. By storing such intermediate data, the cost of running big data jobs can be greatly reduced for businesses using cloud services. An important challenge is how to determine which data should be stored in order to save costs. Existing storing strategies do not differentiate between data with different usage frequencies, resulting in significant storage costs in practical applications. To address the aforementioned challenges, in this paper we propose two online algorithms, one deterministic and the other randomized, which dynamically determine whether to store the data with the aim of saving cost. We show that our proposed deterministic algorithm (resp., randomized) incurs costs within a factor of 2−α′ (resp., 21+α′ ) times the minimum cost obtained by an optimal offline algorithm which is assumed to know the exact future a priori. Finally, through extensive experiments with real-world workload of big data jobs in Alibaba Cloud environment, we demonstrate that our proposed online algorithms can achieve significant cost savings under common cloud pricing schemes.
computer science, theory & methods
What problem does this paper attempt to address?