A Survey on Intelligent Management of Alerts and Incidents in IT Services
Qingyang Yu,Nengwen Zhao,Mingjie Li,Zeyan Li,Honglin Wang,Wenchi Zhang,Kaixin Sui,Dan Pei
DOI: https://doi.org/10.1016/j.jnca.2024.103842
IF: 7.574
2024-01-01
Journal of Network and Computer Applications
Abstract:Modern service systems are constantly improving with the development of various IT technologies, leading to a boost in system scales and complex dependencies among service components. The large scale and complexity of services make them more prone to failure. To maintain services’ normal and stable operation, alert and incident management (AIM), which analyzes and handles service failures in time, has become an important content of IT service management (ITSM). Many intelligent solutions have been proposed to improve the management process. However, there is currently no comprehensive survey that systematically reviews related works. Moreover, no integrated AIM architecture can cover each detailed process or most existing piecemeal solutions. Therefore, we conduct an in-depth survey to address these problems. To the best of our knowledge, the paper is the most comprehensive survey on intelligent AIM in IT services. Through this survey, we make the following contributions. First, we summarize an integrated architecture that includes detailed AIM processes and key techniques. Second, we provide a systematic review of related works based on the architecture. Third, we give a valuable analysis of current challenges and trends in AIM.