General Feature Selection for Failure Prediction in Large-scale SSD Deployment

Fan Xu,Shujie Han,Patrick P. C. Lee,Yi Liu,Cheng He,Jiongzhou Liu
DOI: https://doi.org/10.1109/DSN48987.2021.00039
2021-01-01
Abstract:Solid-state drive (SSD) failures are likely to cause system-level failures leading to downtime, enabling SSD failure prediction to be critical to large-scale SSD deployment. Existing SSD failure prediction studies are mostly based on customized SSDs with proprietary monitoring metrics, which are difficult to reproduce. To support general SSD failure prediction of different drive models and vendors, this paper proposes Wearout-updating Ensemble Feature Ranking (WEFR) to select the SMART attributes as learning features in an automated and robust manner. WEFR combines different feature ranking results and automatically generates the final feature selection based on the complexity measures and the change point detection of wear-out degrees. We evaluate our approach using a dataset of nearly 500K working SSDs at Alibaba. Our results show that the proposed approach is effective and outperforms related approaches. We have successfully applied the proposed approach to improve the reliability of cloud storage systems in production SSD-based data centers. We release our dataset for public use.
What problem does this paper attempt to address?