Optimal Task Replication Considering Reliability, Performance, and Energy Consumption for Parallel Computing in Cloud Systems

Xiwei Qiu,Peng Sun,Yuanshun Dai
DOI: https://doi.org/10.1016/j.ress.2021.107834
IF: 7.247
2021-01-01
Reliability Engineering & System Safety
Abstract:In a cloud-based cyber-physical system, many jobs consist of multiple parallel tasks. The cloud system usually adopts active task replication to improve performance and guarantee the reliability of a job. This technology creates redundant replicas for each task and then executes the replicas concurrently. In the cloud system, each replica is a virtual machine (VM) image that can be easily assigned to different physical machines (PMs) to overcome resource heterogeneity. However, how to design a rational task replication strategy (including replica creation and VM assignment) is indeed a complex issue. It should comprehensively consider correlations and tradeoffs among reliability, performance, and energy consumption. This paper first proposes a reliability-performance correlation model for a job executed by using active task replication. We design a general method to avoid analyzing complex failure correlations and give a Bayesian approach to calculate the performability metric of the job. The paper also proposes a reliability-energy correlation model to evaluate random energy consumption of a PM hosting multiple VMs by using mixed random variables. Finally, an expected net profit optimization model and a genetic algorithm are developed to search for an optimal task replication strategy balancing tradeoffs among reliability, performance, and energy consumption. Illustrative examples are provided.
What problem does this paper attempt to address?