Locating the Clues of Declining Success Rate of Service Calls
Guoping Rong,Hao Wang,Yong You,He Zhang,Jialin Sun,Dong Shao,Yangchen Xu
DOI: https://doi.org/10.1109/issre5003.2020.00039
2020-10-01
Abstract:For many on-line systems with massive users, to provide services continuously and steadily is vital for business, which requires the anomalies of services should be located and resolved in a timely manner. As a common IT infrastructure, various APM (Application Performance Management) systems/frameworks have been adopted to monitor each call request to a service. Nevertheless, the call request may contain multidimensional attributes (e.g., City, ISP, Platform, etc.), which may further contain multiple values (e.g., ISP could be T-Mobile, CMCC, etc.). As a result, an anomaly such as DSR (Declining Success Rate) to service typically occurs with a combination of such attribute values, which creates major challenges to locate the root cause of the anomaly due to potentially huge numbers of the combinations. In this paper, we propose a novel method, ImpAPTr (Impact Analysis based on Pruning Tree), to identify the combination of dimensional attributes as the clues leading to the root cause of anomalies regarding DSR timely. In the evaluation with the simulated dataset, ImpAPTr detects valid clues in milliseconds with an accuracy of 99.37% (within the top 10 candidate results), 97.72% (top 5), and 94.51% (top 3), respectively, which outperforms previous approaches to a large degree. A field test with a production environment dataset indicates that ImpAPTr is able to detect valid clues in a few seconds.