Root Cause Analysis of Concurrent Alarms Based on Random Walk over Anomaly Propagation Graph.

Lingyu Zhang,Jiabao Zhao,Min Zhang
DOI: https://doi.org/10.1109/icnsc48988.2020.9238084
2020-01-01
Abstract:With the development of Internet technology, IT systems are getting more and more complex, in which there are two main relationships among system components: service call relationship and deployment configuration relationship. Once a local anomaly occurs in the system, it tends to spread, triggering emergent and dense concurrent alarms. Hence, it is important to quickly and precisely locate the root cause of concurrent alarms. In this paper, we first construct an anomaly propagation graph using collected system data. Then, based on the graph, we propose two optional algorithms: random walk and state iteration, to track anomaly propagation process and locate the root cause. Simulation experiments demonstrate that our proposed method can localize root causes correctly and rapidly for scenarios with complex call chains and resource competition, and is robust to alarm error. The proposed method pays more attention to system characteristics and depends little on experience knowledge of IT operators.
What problem does this paper attempt to address?