Abstract:Physical and cloud storage services are well-served by functioning and reliable high-volume storage systems. Recent observations point to hard disk reliability as one of the most pressing reliability issues in data centers containing massive volumes of storage devices such as HDDs. In this regard, early detection of impending failure at the disk level aids in reducing system downtime and reduces operational loss making proactive health monitoring a priority for AIOps in such settings. In this work, we introduce methods of extracting meaningful attributes associated with operational failure and of pre-processing the highly imbalanced health statistics data for subsequent prediction tasks using data-driven approaches. We use a Bidirectional LSTM with a multi-day look back period to learn the temporal progression of health indicators and baseline them against vanilla LSTM and Random Forest models to come up with several key metrics that establish the usefulness of and superiority of our model under some tightly defined operational constraints. For example, using a 15 day look back period, our approach can predict the occurrence of disk failure with an accuracy of 96.4% considering test data 60 days before failure. This helps to alert operations maintenance well in-advance about potential mitigation needs. In addition, our model reports a mean absolute error of 0.12 for predicting failure up to 60 days in advance, placing it among the state-of-the-art in recent literature.

An Introduction to PAKDD CUP 2020 Dataset

DRAM Failure Prediction in AIOps: Empirical Evaluation, Challenges and Opportunities

A Failure Prediction Approach Based on BiLSTM and Deep Feature Extractor for Hard Disk Drives.

Robust Data Preprocessing for Machine-Learning-Based Disk Failure Prediction in Cloud Production Environments

An In-Depth Study Of Correlated Failures In Production Ssd-Based Data Centers

General Feature Selection for Failure Prediction in Large-scale SSD Deployment

Hard Drive Failure Prediction Using Big Data

Disk failure prediction based on multi-layer domain adaptive learning

Hard Disk Failure Prediction Based on Lightgbm with CID.

The Life and Death of SSDs and HDDs: Similarities, Differences, and Prediction Models

Classification Based Hard Disk Drive Failure Prediction: Methodologies, Performance Evaluation and Comparison.

Toward Adaptive Disk Failure Prediction Via Stream Mining

Significance of Disk Failure Prediction in Datacenters

ADF2T: an Active Disk Failure Forecasting and Tolerance Software

A Practical Cross-Datacenter Fault-Tolerance Algorithm in the Cloud Storage System.

Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional LSTM Networks

Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform

Highly Efficient Memory Failure Prediction using Mcelog-based Data Mining and Machine Learning

Hard drive failure prediction using Decision Trees

ZTE-Predictor: Disk Failure Prediction System Based on LSTM

Failure Prediction of Hard Disk Drives Based on Adaptive Rao–Blackwellized Particle Filter Error Tracking Method