Improving Failure Tolerance in Large-Scale Cloud Computing Systems
Liang Luo,Xiwei Qiu,Yuanshun Dai,Sa Meng
DOI: https://doi.org/10.1109/tr.2019.2901194
IF: 5.883
2019-06-01
IEEE Transactions on Reliability
Abstract:Large-scale cloud computing systems have served as the fundamental supporting platform for big data, Internet of Things, and artificial intelligence applications for the past decade. With the scale and complexity of these systems increasing dramatically, various hardware and software failures will inevitably occur and may not be detected and repaired in a timely manner. Besides, sophisticated architectural features of cloud computing may also have an adverse impact on system reliability. In response to these challenges, this paper proposes a simulation-driven framework based on real cloud computing system operation logs for improving failure tolerance in large-scale cloud computing systems. For a given cloud computing system, we first conduct a systematic analysis of its structure and operation characteristics. A Markov-based model is used to examine the system's potential failures, assess their severities, and suggest quick recoveries. During this process, the proposed reliability-aware resource scheduling algorithm is adopted to optimize resources so that the system's reliability can be improved cost-effectively. We also report a case study to demonstrate the application of our algorithm in improving failure tolerance of a large-scale cloud computing system.
engineering, electrical & electronic,computer science, software engineering, hardware & architecture