Root-Cause Metric Location for Microservice Systems Via Log Anomaly Detection

Lingzhi Wang,Nengwen Zhao,Junjie Chen,Pinnong Li,Wenchi Zhang,Kaixin Sui
DOI: https://doi.org/10.1109/icws49710.2020.00026
2020-01-01
Abstract:Microservice systems are typically fragile and failures are inevitable in them due to their complexity and large scale. However, it is challenging to localize the root-cause metric due to its complicated dependencies and the huge number of various metrics. Existing methods are based on either correlation between metrics or correlation between metrics and failures. All of them ignore the key data source in microservice, i.e., logs. In this paper, we propose a novel root-cause metric localization approach by incorporating log anomaly detection. Our approach is based on a key observation, the value of root-cause metric should be changed along with the change of the log anomaly score of the system caused by the failure. Specifically, our approach includes two components, collecting anomaly scores by log anomaly detection algorithm and identifying root-cause metric by robust correlation analysis with data augmentation. Experiments on an open-source benchmark microservice system have demonstrated our approach can identify root-cause metrics more accurately than existing methods and only require a short localization time. Therefore, our approach can assist engineers to save much effort in diagnosing and mitigating failures as soon as possible.
What problem does this paper attempt to address?