Generic and Robust Performance Diagnosis Via Causal Inference for OLTP Database Systems
Xianglin Lu,Zhe Xie,Zeyan Li,Mingjie Li,Xiaohui Nie,Nengwen Zhao,Qingyang Yu,Shenglin Zhan,Kaixin Sui,Lin Zhu,Dan Pei
DOI: https://doi.org/10.1109/ccgrid54584.2022.00075
2022-01-01
Abstract:Online transaction processing (OLTP) database systems provide an effective solution to data support for online applications with high concurrency and low latency. An interruption or performance degradation of OLTP database systems may impact the availability of services and bring substantial economic loss. Thus, diagnosing the issue timely and mitigating it rapidly are essential for database administrators (DBAs). However, performance diagnosis for database systems is challenging due to numerous abnormal metrics, complex failure propagation, and high-performance requirements. Existing works relying on anomaly detection or causal graph construction cannot handle all these challenges simultaneously. In this paper, we propose an unsupervised learning-based method, CauseRank, to perform root cause localization with superior efficiency, high accuracy, and good interpretability. Two key techniques in CauseRank are a novel causal discovery algorithm named Group-based Greedy Equivalent Search (G-GES) incorporated with domain knowledge which treats metric groups as nodes to capture failure propagation and a simple yet effective ranking method named Causal Oriented Personalized PageRank (COPP). Extensive experiments on 97 real-world failure cases collected from a large-scale Oracle database demonstrate the effectiveness of CauseRank, achieving 82.5% top-3 accuracy and 93.8% top-5 accuracy and outperforming baseline approaches. The core idea and framework of CauseRank are generic and can be applied to other large-scale system components.