Abstract:AIOps (Artificial Intelligence for IT Operations) solutions leverage the massive data produced during the operation of large-scale systems and machine learning models to assist software engineers in their system operations. As operation data produced in the field are constantly evolving due to factors such as the changing operational environment and user base, the models in AIOps solutions need to be constantly maintained after deployment. While prior works focus on innovative modeling techniques to improve the performance of AIOps models before releasing them into the field, when and how to update AIOps models remain an under-investigated topic. In this work, we performed a case study on three large-scale public operation data and empirically assessed five different types of model update strategies for supervised learning regarding their performance, updating cost, and stability. We observed that active model update strategies (e.g., periodical retraining, concept drift guided retraining, time-based model ensembles, and online learning) achieve better and more stable performance than a stationary model. Particularly, applying sophisticated model update strategies could provide better performance, efficiency, and stability than simply retraining AIOps models periodically. In addition, we observed that, although some update strategies can save model training time, they significantly sacrifice model testing time, which could hinder their applications in AIOps solutions where the operation data arrive at high pace and volume and where immediate inferences are required. Our findings highlight that practitioners should consider the evolution of operation data and actively maintain AIOps models over time. Our observations can also guide researchers and practitioners in investigating more efficient and effective model update strategies that fit in the context of AIOps.

What problem does this paper attempt to address?

This paper focuses on the updating strategies of supervised learning models in AIOps (Artificial Intelligence for IT Operations) solutions. AIOps utilizes large-scale system operation data and machine learning models to assist software engineers in system maintenance. Due to the continuous evolution of operational data caused by environmental changes and user base variations, the deployed AIOps models require ongoing maintenance. Although previous work has focused on improving the performance of AIOps models before deployment, when and how to update these models has not been sufficiently investigated. The paper conducts a case study using three large public operational datasets: trace datasets from Google and Alibaba Cloud platforms, as well as disk statistics datasets from the BackBlaze cloud storage data center. The authors evaluate five different model updating strategies, focusing on their performance, update costs, and stability. The study finds that proactive updating strategies (such as periodic retraining, concept-drift-guided retraining, time-based model ensembles, and online learning) outperform static models and exhibit greater stability. In particular, applying complex updating strategies (such as concept drift detection, time-based ensembles, and online learning) may provide better performance, efficiency, and stability than simple periodic retraining. However, certain updating strategies (such as time-based ensembles and online learning) may save training time but significantly increase testing time, which could limit their application in high-speed and high-volume AIOps solutions. The paper emphasizes that practitioners should consider the evolution of operational data and proactively maintain AIOps models over time. It provides guidance to both researchers and practitioners to explore more efficient and effective model updating strategies that are suitable for the context of AIOps.

On the Model Update Strategies for Supervised Learning in AIOps Solutions

A novel lifelong machine learning-based method to eliminate calibration drift in clinical prediction models.

Online dual updating with recursive PLS model and its application in predicting crystal size of purified terephthalic acid (PTA) process

Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real World

Quality Monitoring and Assessment of Deployed Deep Learning Models for Network AIOps

Component Modeling and Updating Method of Integrated Energy Systems Based on Knowledge Distillation

Towards a consistent interpretation of AIOps models

Dynamic Load Change Operation Education in Air Separation Processes Using a Multivariable and Nonlinear Model

Learning to Diagnose: Meta-Learning for Efficient Adaptation in Few-Shot AIOps Scenarios

AIOps in Action: Automating AI Deployment and Management of Large Language Models for Scalable and Ethical Operations

A Survey of AIOps for Failure Management in the Era of Large Language Models

Towards Automating the AI Operations Lifecycle

An Efficient Model Maintenance Approach for MLOps

How do I update my model? On the resilience of Predictive Process Monitoring models to change

A ROADMAP TO SUCCESS: STRATEGIES AND CHALLENGES INADOPTING AIOPS FOR IT OPERATIONS

McUDI: Model-Centric Unsupervised Degradation Indicator for Failure Prediction AIOps Solutions

Research of Artificial Intelligence Operations for Wind Turbines Considering Anomaly Detection, Root Cause Analysis, and Incremental Training

Adaptive Nonlinear Model Predictive Control Using an On-line Support Vector Regression Updating Strategy

Automating the Training and Deployment of Models in MLOps by Integrating Systems with Machine Learning

A Survey of AIOps Methods for Failure Management

On-Premise AIOps Infrastructure for a Software Editor SME: An Experience Report