Abstract:Hard Disk Drive (HDD) failures in datacenters are costly - from catastrophic data loss to a question of goodwill, stakeholders want to avoid it like the plague. An important tool in proactively monitoring against HDD failure is timely estimation of the Remaining Useful Life (RUL). To this end, the Self-Monitoring, Analysis and Reporting Technology employed within HDDs (S.M.A.R.T.) provide critical logs for long-term maintenance of the security and dependability of these essential data storage devices. Data-driven predictive models in the past have used these S.M.A.R.T. logs and CNN/RNN based architectures heavily. However, they have suffered significantly in providing a confidence interval around the predicted RUL values as well as in processing very long sequences of logs. In addition, some of these approaches, such as those based on LSTMs, are inherently slow to train and have tedious feature engineering overheads. To overcome these challenges, in this work we propose a novel transformer architecture - a Temporal-fusion Bi-encoder Self-attention Transformer (TFBEST) for predicting failures in hard-drives. It is an encoder-decoder based deep learning technique that enhances the context gained from understanding health statistics sequences and predicts a sequence of the number of days remaining before a disk potentially fails. In this paper, we also provide a novel confidence margin statistic that can help manufacturers replace a hard-drive within a time frame. Experiments on Seagate HDD data show that our method significantly outperforms the state-of-the-art RUL prediction methods during testing over the exhaustive 10-year data from Backblaze (2013-present). Although validated on HDD failure prediction, the TFBEST architecture is well-suited for other prognostics applications and may be adapted for allied regression problems.

Improving the accuracy, adaptability, and interpretability of SSD failure prediction models

General Feature Selection for Failure Prediction in Large-scale SSD Deployment

A Failure Prediction Approach Based on BiLSTM and Deep Feature Extractor for Hard Disk Drives.

An In-Depth Study Of Correlated Failures In Production Ssd-Based Data Centers

The Life and Death of SSDs and HDDs: Similarities, Differences, and Prediction Models

Temporal-Contextual Attention Network for Solid-State Drive Failure Prediction in Data Centers

Modeling 3D NAND Flash with Nonparametric Inference on Regression Coefficients for Reliable Solid-State Storage

Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks

A Machine-Learning-based Data Classifier to Reduce the Write Amplification in SSDs

Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives

Hard drive failure prediction using Decision Trees

Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional LSTM Networks

TFBEST: Dual-Aspect Transformer with Learnable Positional Encoding for Failure Prediction

NVMe and PCIe SSD Monitoring in Hyperscale Data Centers

Boosting Correlated Failure Repair in SSD Data Centers

RecSSD: near data processing for solid state drive based recommendation inference

Toward Adaptive Disk Failure Prediction Via Stream Mining

Towards Learned Predictability of Storage Systems

A locally weighted multi-domain collaborative adaptation for failure prediction in SSDs

Classification Based Hard Disk Drive Failure Prediction: Methodologies, Performance Evaluation and Comparison.

Layerwise Perturbation-Based Adversarial Training for Hard Drive Health Degree Prediction