Modeling SSD RAID reliability under general settings

Zhiyong Wu,Yongkun Li,Patrick P. C. Lee,Yinlong Xu
DOI: https://doi.org/10.1145/3203217.3203236
2018-01-01
Abstract:ABSTRACTSolid-state drives (SSDs) are susceptible to the limited number of program/erase (P/E) cycles and uncorrectable flash errors, and hence achieving high reliability of SSD storage systems is a critical issue. RAID provides a viable option for enhancing system reliability by distributing redundancy across a number of SSDs. However, the flash error rate of an SSD increases with the number of P/E cycles, and this time-varying nature complicates the reliability analysis of SSD RAID. In addition, there remains very limited formal analysis that quantifies the reliability dynamics of an SSD RAID array under general settings. To this end, we propose a new continuous time Markov chain (CTMC) model to characterize the reliability dynamics of SSD RAID over time under two general settings: (1) fault tolerance against a general number of device failures and (2) non-uniform workload. We validate the correctness of our CTMC model via trace-driven simulations. Based on our model, we further analyze the impact of different RAID parameters on the reliability dynamics of an SSD RAID array.
What problem does this paper attempt to address?