From Missteps to Milestones: A Journey to Practical Fail-Slow Detection

Ruiming Lu,Erci Xu,Yiming Zhang,Fengyi Zhu,Zhaosheng Zhu,Mengtian Wang,Zongpeng Zhu,Guangtao Xue,Jiwu Shu,Minglu Li,Jiesheng Wu
DOI: https://doi.org/10.1145/3617690
2023-01-01
ACM Transactions on Storage
Abstract:The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus , a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.
What problem does this paper attempt to address?