Unsupervised Detection of Microservice Trace Anomalies Through Service-Level Deep Bayesian Networks

Ping Liu,Haowen Xu,Qianyu Ouyang,Rui Jiao,Zhekang Chen,Shenglin Zhang,Jiahai Yang,Linlin Mo,Jice Zeng,Wenman Xue,Dan Pei
DOI: https://doi.org/10.1109/issre5003.2020.00014
2020-01-01
Abstract:The anomalies of microservice invocation traces (traces) often indicate that the quality of the microservice-based large software service is being impaired. However, timely and accurately detecting trace anomalies is very challenging due to: 1) the large number of underlying microservices, 2) the complex call relationships between them, 3) the interdependency between the response times and invocation paths. Our core idea is to use machine learning to automatically learn the overall normal patterns of traces during periodic offline training. In online anomaly detection, a new trace with a small anomaly score (computed based on the learned normal pattern) is considered anomalous. With our novel trace representation and the design of deep Bayesian networks with posterior flow, our unsupervised anomaly detection system, called TraceAnomaly, can accurately and robustly detect trace anomalies in a unified fashion. TraceAnomaly has been deployed on 18 online services in a company S. Detailed evaluations on four large online services which contain hundreds of microservices and a testbed which contains 41 microservices show that the recall and precision of TraceAnomaly are both above 0.97, outperforming the existing approach in S (hard-coded rule) by 19.6% and 7.1%, and seven other baselines by 57.0% and 41.6% on average.
What problem does this paper attempt to address?