GRANO: Interactive Graph-based Root Cause Analysis for Cloud-Native Distributed Data Platform

Sanjeev Katariya,Phuong T. Nguyen,Gene Zhang,Jun Li,Selçuk Köprü,Hanzhang Wang,S. Ben-Romdhane
DOI: https://doi.org/10.14778/3352063.3352105
IF: 2.5
2019-08-01
Proceedings of the VLDB Endowment
Abstract:We demonstrate Grano 1 , an end-to-end anomaly detection and root cause analysis (or RCA for short) system for cloud-native distributed data platform by providing a holistic view of the system component topology, alarms and application events. Grano provides: a Detection Layer to process large amount of time-series monitoring data to detect anomalies at logical and physical system components; an Anomaly Graph Layer with novel graph modeling and algorithms for leveraging system topology data and detection results to identify the root cause relevance at the system component level; and an Application Layer that automatically notifies on-call personnel and presents real-time and on-demand RCA support through an interactive graph interface. The system is deployed and evaluated using eBay’s production data to help on-call personnel to shorten the identification of root cause from hours to minutes.
Engineering,Computer Science
What problem does this paper attempt to address?