PinSQL: Pinpoint Root Cause SQLs to Resolve Performance Issues in Cloud Databases

Xiaoze Liu,Zheng Yin,Chao Zhao,Congcong Ge,Lu Chen,Yunjun Gao,Dimeng Li,Ziting Wang,Gaozhong Liang,Jian Tan,Feifei Li
DOI: https://doi.org/10.1109/icde53745.2022.00236
2022-01-01
Abstract:Deploying database services on cloud systems has gained increasing popularity and has become a common practice in the industry. However, the complicated cloud environments make performance issues inevitable, which could violate the service level guarantee if not addressed in a timely manner. Among the various problems, anomalies in SQL queries are the most commonly reported sources that cause performance issues in database applications. These anomalous queries can be divided into High-impact SQLs (H-SQLs) and Root Cause SQLs (R-SQLs), representing the related SQLs that are correlated with the anomalies and the ones that are the root causes of the performance issue, respectively. In the presence of a large number of queries, to pinpoint the R-SQLs is far more difficult than to identify the H-SQLs. To address this challenge, we aim at automatically pinpointing the R-SQLs to resolve performance issues in cloud databases. This paper introduces PinSQL, an autonomous diagnosing system for Alibaba Cloud, which has four modules that are executed sequentially, including data collection and pre-processing, anomaly detection, root cause analysis, and repairing actions. First, the related performance metrics and query logs from monitored cloud database instances are collected and aggregated as the data sources. Then, based on these inputs, efficient anomaly detection is conducted in real-time. Upon the detection of an anomaly, the root cause SQLs are pinpointed through tracking the propagation chain of the involved SQLs. Finally, repairing actions are suggested and then executed on R-SQLs to address the anomalies. Extensive experiments on an Alibaba production system show that PinSQL can achieve an 80% accuracy for pinpointing the top-1 R-SQLs and successfully resolve the database performance issues resultantly.
What problem does this paper attempt to address?