AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review

Youcef Remil,Anes Bendimerad,Romain Mathonat,Mehdi Kaytoue
2024-04-02
Abstract:The management of modern IT systems poses unique challenges, necessitating scalability, reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on manual tasks and rule-based approaches, prove inefficient for the substantial data volumes and alerts generated by IT systems. Artificial Intelligence for Operating Systems (AIOps) has emerged as a solution, leveraging advanced analytics like machine learning and big data to enhance incident management. AIOps detects and predicts incidents, identifies root causes, and automates healing actions, improving quality and reducing operational costs. However, despite its potential, the AIOps domain is still in its early stages, decentralized across multiple sectors, and lacking standardized conventions. Research and industrial contributions are distributed without consistent frameworks for data management, target problems, implementation details, requirements, and capabilities. This study proposes an AIOps terminology and taxonomy, establishing a structured incident management procedure and providing guidelines for constructing an AIOps framework. The research also categorizes contributions based on criteria such as incident management tasks, application areas, data sources, and technical approaches. The goal is to provide a comprehensive review of technical and research aspects in AIOps for incident management, aiming to structure knowledge, identify gaps, and establish a foundation for future developments in the field.
Operating Systems,Artificial Intelligence,Software Engineering
What problem does this paper attempt to address?
This paper focuses on how to solve the problem of event management in modern complex IT systems through Artificial Intelligence for IT Operations (AIOps). Traditional manual and rule-driven approaches are inefficient in handling large volumes of data streams, while AIOps leverages machine learning and big data technologies to enhance event detection, prediction, root cause analysis, and automated remediation, in order to improve service quality and reduce operational costs. However, the AIOps field is currently in its early stages, lacking standardization, and research and industrial contributions are scattered across different frameworks and requirements. The objective of the paper is to establish AIOps terminology and taxonomy, providing structured event management processes and guiding principles for building AIOps frameworks. It also categorizes research contributions in terms of event management tasks, application domains, data sources, and technical approaches, aiming to provide a comprehensive overview, structured knowledge, identify research gaps, and lay the foundation for future developments in this field. The focus of the paper lies in the application of AIOps in event management, including requirements for data collection, storage, visualization, task definition, model construction, and evaluation metrics.