Abstract:Deploying database services on cloud systems has gained increasing popularity and has become a common practice in the industry. However, the complicated cloud environments make performance issues inevitable, which could violate the service level guarantee if not addressed in a timely manner. Among the various problems, anomalies in SQL queries are the most commonly reported sources that cause performance issues in database applications. These anomalous queries can be divided into High-impact SQLs (H-SQLs) and Root Cause SQLs (R-SQLs), representing the related SQLs that are correlated with the anomalies and the ones that are the root causes of the performance issue, respectively. In the presence of a large number of queries, to pinpoint the R-SQLs is far more difficult than to identify the H-SQLs. To address this challenge, we aim at automatically pinpointing the R-SQLs to resolve performance issues in cloud databases. This paper introduces PinSQL, an autonomous diagnosing system for Alibaba Cloud, which has four modules that are executed sequentially, including data collection and pre-processing, anomaly detection, root cause analysis, and repairing actions. First, the related performance metrics and query logs from monitored cloud database instances are collected and aggregated as the data sources. Then, based on these inputs, efficient anomaly detection is conducted in real-time. Upon the detection of an anomaly, the root cause SQLs are pinpointed through tracking the propagation chain of the involved SQLs. Finally, repairing actions are suggested and then executed on R-SQLs to address the anomalies. Extensive experiments on an Alibaba production system show that PinSQL can achieve an 80% accuracy for pinpointing the top-1 R-SQLs and successfully resolve the database performance issues resultantly.

PinSQL: Pinpoint Root Cause SQLs to Resolve Performance Issues in Cloud Databases

Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases

Schema-Driven Performance Evaluation for Highly Concurrent Scenarios.

FlowPinpoint: Localizing Anomalies in Cloud-client Services for Cloud Providers

Distance Based Root Cause Analysis and Change Impact Analysis of Performance Regressions

CloudPin: A Root Cause Localization Framework of Shared Bandwidth Package Traffic Anomalies in Public Cloud Networks

PolarDB-IMCI: A Cloud-Native HTAP Database System at Alibaba.

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

Localizing Root Causes of Performance Anomalies in Cloud Computing Systems by Analyzing Request Trace Logs

Generic and Robust Performance Diagnosis Via Causal Inference for OLTP Database Systems

A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications

A Real-Time Trace-Level Root-Cause Diagnosis System in Alibaba Datacenters

Hierarchical Diagnostic Approach for Performance Problems in Cloud Computing Platforms

Root Cause Analysis of Anomalies of Multitier Services in Public Clouds.

A Comparative Study of In-Database Inference Approaches

FluxInfer: Automatic Diagnosis of Performance Anomaly for Online Database System

Real-time Workload Pattern Analysis for Large-scale Cloud Databases

Magnifier: Online Detection of Performance Problems in Large-Scale Cloud Computing Systems

TencentCLS: The Cloud Log Service with High Query Performances

Sage: Leveraging ML to Diagnose Unpredictable Performance in Cloud Microservices

DBPA: A Benchmark for Transactional Database Performance Anomalies.