CauseInfer: Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Environment

Pengfei Chen,Yong Qi,Di Hou
DOI: https://doi.org/10.1109/tsc.2016.2607739
IF: 11.019
2019-01-01
IEEE Transactions on Services Computing
Abstract:Modern computing systems especially cloud-based and cloud-centric systems always consist of a mass of components running in large distributed environments with complicated interactions. They are vulnerable to performance problems due to the highly dynamic runtime environment changes (e.g., overload and resource contention) or software bugs (e.g., memory leak). Unfortunately, it is notoriously difficult to diagnose the root causes of these performance problems in a fine granularity due to complicated interactions and a large cardinality of potential cause set. In this paper, we build an automated, black-box and end-to-end cause inference system named CauseInfer to pinpoint the root causes or at least provide some hints. CauseInfer can automatically map a distributed system to a two-layer hierarchical causality graph and infer the root causes along the causal paths in the causality graph. CauseInfer models the fault propagation paths in an explicit way and works without instrumentation to the running production system, which makes CauseInfer more effective and practical than previous approaches. The experimental evaluations in two benchmark systems show that CauseInfer can identify the root causes in a high accuracy. Compared to several state-of-the-art approaches, CauseInfer can achieve over 10 percent improvement. Moreover, CauseInfer is lightweight and flexible enough to readily scale out in large distributed systems. With CauseInfer, the mean time to recovery (MTTR) of the cloud systems can be significantly reduced.
What problem does this paper attempt to address?