Locating Anomaly Clues for Atypical Anomalous Services: An Industrial Exploration
Guoping Rong,Hao Wang,Shenghui Gu,Yangchen Xu,Jialin Sun,Dong Shao,He Zhang
DOI: https://doi.org/10.1109/tdsc.2022.3181143
2022-01-01
IEEE Transactions on Dependable and Secure Computing
Abstract:Continuity and steadiness are vital for services with massive users, which requires the anomalies of services should be detected and resolved in a timely manner. Our previous work proposed a tool, namely ImpAPTr (Impact Analysis based on Pruning Tree), to identify the combination of multiple dimensional attributes as the clues leading to the root cause of service anomalies. However, ImpAPTr applies a threshold driven strategy, i.e. it needs to be triggered by a ≥ drop of the success rate of the service calls (abbr. SRSC), which may face problems in an atypical yet pervasive situation in field application. For example, the combination of trivial anomalies (i.e. each causes a drop less than 0.05% to SRSC) can lead to a far more than 0.05% drop on SRSC. Besides, a suitable threshold is usually hard to be determined, etc. To address these problems, we propose a new method, namely ImpAPTr+ in this paper to free the constraint of the 0.05% threshold. The basic idea is to involve time dimension and identify clues across multiple time intervals of data. We performed evaluation on three typical methods (i.e. ImpAPTr+, R-Adtributor and Squeeze) with both production environment dataset and simulation dataset. The former dataset is directly retrieved from the service monitoring data in Meituan, one of the largest on-line service providers worldwide. The latter dataset is fabricated also using the monitoring data from the same company. The results indicate: (1) ImpAPTr+ outperforms previous approaches to a large degree in terms of accuracy. (2) Both ImpAPTr+ and R-Adtributor are able to find proper clues within seconds. (3) ImpAPTr+ tends to find proper clues with shorter time intervals (i.e. less data), which implies that the method is more suitable for near real-time monitoring scenarios.
computer science, information systems, software engineering, hardware & architecture