Reliability Equations for Cloud Storage Systems with Proactive Fault Tolerance

Jing Li,Mingze Li,Gang Wang,Xiaoguang Liu,Zhongwei Li,Huijun Tang
DOI: https://doi.org/10.1109/tdsc.2018.2882512
2020-01-01
Abstract:As cloud storage systems increase in scale, hard drive failures are becoming more frequent, which raises reliability issues. In addition to traditional reactive fault tolerance, proactive fault tolerance is used to improve a system's reliability. However, there are few studies which analyze the reliability of proactive cloud storage systems, and they typically assume an exponential distribution for drive failures. This paper presents closed-form equations for estimating the number of data-loss events in proactive cloud storage systems using RAID-5, RAID-6, 2-way replication, and 3-way replication mechanisms, within a given time period. The equations model the impact of proactive fault tolerance, operational failures, failure restorations, latent block defects, and drive scrubbing on the systems reliability, and use time-based Weibull distributions to represent processes (instead of homogeneous Poisson processes). We also design a Monte-Carlo simulation method to simulate the running of proactive cloud storage systems. The proposed equations closely match time-consuming Monte-Carlo simulations, using parameters obtained from the analysis of field data. These equations allow designers to efficiently estimate system reliability under varying parameters, facilitating cloud storage system design.
What problem does this paper attempt to address?