Tools for Predicting the Reliability of Large-Scale Storage Systems

Robert J. Hall
DOI: https://doi.org/10.1145/2911987
2016-08-29
ACM Transactions on Storage
Abstract:Data-intensive applications require extreme scaling of their underlying storage systems. Such scaling, together with the fact that storage systems must be implemented in actual data centers, increases the risk of data loss from failures of underlying components. Accurate engineering requires quantitatively predicting reliability, but this remains challenging due to the need to account for extreme scale, redundancy scheme type and strength, distribution architecture, and component dependencies. This article introduces CQS im -R, a tool suite for predicting the reliability of large-scale storage system designs and deployments. CQS im -R includes (a) direct calculations based on an only-drives-fail failure model and (b) an event-based simulator for detailed prediction that handles failures of and failure dependencies among arbitrary (drive or nondrive) components. These are based on a common combinatorial framework for modeling placement strategies. The article demonstrates CQS im -R using models of common storage systems, including replicated and erasure coded designs. New results, such as the poor reliability scaling of spread-placed systems and a quantification of the impact of data center distribution and rack-awareness on reliability, demonstrate the usefulness and generality of the tools. Analysis and empirical studies show the tools’ soundness, performance, and scalability.
computer science, software engineering, hardware & architecture
What problem does this paper attempt to address?