Abstract:With the rapid development of cloud computing and big data technologies, storage systems have become a fundamental building block of datacenters, incorporating hardware innovations such as flash solid state drives and non-volatile memories, as well as software infrastructures such as RAID and distributed file systems. Despite the growing popularity and interests in storage, designing and implementing reliable storage systems remains challenging, due to their performance instability and prevailing hardware failures. Proactive prediction greatly strengthens the reliability of storage systems. There are two dimensions of prediction: performance and failure. Ideally, through detecting in advance the slow IO requests, and predicting device failures before they really happen, we can build storage systems with especially low tail latency and high availability. While its importance is well recognized, such proactive prediction in storage systems, on the other hand, is particularly difficult. To move towards predictability of storage systems, various mechanisms and field studies have been proposed in the past few years. In this report, we present a survey of these mechanisms and field studies, focusing on machine learning based black-box approaches. Based on three representative research works, we discuss where and how machine learning should be applied in this field. The strengths and limitations of each research work are also evaluated in detail.

Tools for Predicting the Reliability of Large-Scale Storage Systems

Combining Model Checking and Testing with an Application to Reliability Prediction and Distribution

Reliability Analysis of Distributed Storage Systems Considering Data Loss and Theft

Dependability Analysis of a Cache-Based RAID System Via Fast Distributed Simulation

Reliability Assessment of Data Storage in Cyber Physical Systems

Towards Learned Predictability of Storage Systems

Approximate Reliability Evaluation of Large-Scale Multistate Series-Parallel Systems

Reliability Provision Mechanism for Large-Scale De-Duplication Storage Systems

A Reliability Model for Dependent and Distributed MDS Disk Array Units

Failure Analysis and Quantification for Contemporary and Future Supercomputers

Fiducial Approach for the Storage Reliability Assessment of Complex Repairable Systems

R-Admad: High Reliability Provision For Large-Scale De-Duplication Archival Storage Systems

An In-Depth Study Of Correlated Failures In Production Ssd-Based Data Centers

Fuzzy Reliability Analysis of an iSCSI-Based Fault Tolerant Storage System Organization

Random Versus Copyset Placement: Data-Loss Models for Proactive-Tolerance Replica-Based Data Storage

Significance of Disk Failure Prediction in Datacenters

A Speculation-Based Approach for Performance and Dependability Analysis: a Case Study

Fuzzy Reliability Analysis of Distributed Storage System for Tolerating Double Node and Disk Failures

Stochastic Analysis on RAID Reliability for Solid-State Drives

The Life and Death of SSDs and HDDs: Similarities, Differences, and Prediction Models

Prediction of Future Failures for Heterogeneous Reliability Field Data