Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems

Yiru Chen,Chenxi Zhang,Zhen Dong,Dingyu Yang,Xin Peng,Jiayu Ou,Hong Yang,Zheshun Wu,Xiaojun Qu,Wei Li
DOI: https://doi.org/10.1109/ase56229.2023.00177
2024-01-01
Abstract:A fault in large online service systems often triggers numerous alerts due to the complex business and component dependencies among services, which is known as “alert storm”. In a short time, an online service system may generate a huge amount of alert data. This poses a challenge for on-call engineers to identify alerts that are associated with a system failure for root cause analysis. In this paper, we propose DyAlert, a dynamic graph neural networks-based approach for linking alerts that might be triggered by a same fault to reduce the burden of on-call engineers in the fault analysis. Our insight is that alerts are often triggered by alert propagation when a system failure occurs, e.g., alert $a$ would lead to the occurrence of alert $b$ . Whether two alerts should be linked depends on if one alert is triggered by the propagation of the other. Leveraging this insight, we design a dynamic graph (namely Alert-Metric Dynamic Graph) that describes the propagation process of alerts. Based on the dynamic graph, we train a neural networks-based model to predict alert links. We evaluate DyAlert with real-world data collected from an online service system running 85 business units and about 30,000 different services in a large enterprise. The results show that DyAlert is effective in predicting alert links and it outperforms the state-of-the-art approaches with an average increase of 0.259 in F1-score.
What problem does this paper attempt to address?