McUDI: Model-Centric Unsupervised Degradation Indicator for Failure Prediction AIOps Solutions

Lorena Poenaru-Olaru,Luis Cruz,Jan Rellermeyer,Arie van Deursen
2024-01-25
Abstract:Due to the continuous change in operational data, AIOps solutions suffer from performance degradation over time. Although periodic retraining is the state-of-the-art technique to preserve the failure prediction AIOps models' performance over time, this technique requires a considerable amount of labeled data to retrain. In AIOps obtaining label data is expensive since it requires the availability of domain experts to intensively annotate it. In this paper, we present McUDI, a model-centric unsupervised degradation indicator that is capable of detecting the exact moment the AIOps model requires retraining as a result of changes in data. We further show how employing McUDI in the maintenance pipeline of AIOps solutions can reduce the number of samples that require annotations with 30k for job failure prediction and 260k for disk failure prediction while achieving similar performance with periodic retraining.
Software Engineering,Machine Learning
What problem does this paper attempt to address?
This paper focuses on how to address the performance degradation issue of machine learning models caused by data changes in the AIOps (Artificial Intelligence for IT Operations) solution. The existing approach is to regularly retrain the models, which requires a large amount of annotated data that is expensive and time-consuming to obtain in the AIOps domain. This paper proposes a new method called McUDI (Model-Centric Unsupervised Degradation Indicator), which can detect when a model needs to be retrained due to data changes, thereby reducing the need for annotated data. McUDI detects drift by computing the importance rankings of features in the model and selecting features with higher average importance. It uses the Kolmogorov-Smirnov statistical test to analyze the data distribution of these important features and identify data drift. The paper shows that incorporating McUDI into the maintenance process of AIOps solutions can significantly reduce the number of annotated samples required while maintaining similar performance to regular retraining. In addition, the paper compares the performance of current data monitoring practices on fault prediction data, as well as the comparison between McUDI and other methods, demonstrating the advantages of McUDI in reducing the need for retraining and maintaining model performance. The study also proposes a maintenance pipeline based on McUDI that is more cost-effective in terms of label acquisition. In conclusion, this paper aims to address the degradation of AIOps models due to data concept drift and proposes a new unsupervised method, McUDI, which improves the efficiency and cost-effectiveness of model maintenance.