CHAOS: Accurate and Realtime Detection of Aging-Oriented Failure Using Entropy.

Pengfei Chen,Yong Qi,Di Hou
2015-01-01
Abstract: Even well-designed software systems suffer from chronic performance degradation, also named "software aging", due to internal (e.g. software bugs) and external (e.g. resource exhaustion) impairments. These chronic problems often fly under the radar of software monitoring systems before causing severe impacts (e.g. system failure). Therefore it's a challenging issue how to timely detect these problems to prevent system crash. Although a large quantity of approaches have been proposed to solve this issue, the accuracy and effectiveness of these approaches are still far from satisfactory due to the insufficiency of aging indicators adopted by them. In this paper, we present a novel entropy-based aging indicator, Multidimensional Multi-scale Entropy (MMSE). MMSE employs the complexity embedded in runtime performance metrics to indicate software aging and leverages multi-scale and multi-dimension integration to tolerate system fluctuations. Via theoretical proof and experimental evaluation, we demonstrate that MMSE satisfies Stability, Monotonicity and Integration which we conjecture that an ideal aging indicator should have. Based upon MMSE, we develop three failure detection approaches encapsulated in a proof-of-concept named CHAOS. The experimental evaluations in a Video on Demand (VoD) system and in a real-world production system, AntVision, show that CHAOS can detect the failure-prone state in an extraordinarily high accuracy and a near 0 Ahead-Time-To-Failure (ATTF). Compared to previous approaches, CHAOS improves the detection accuracy by about 5 times and reduces the ATTF even by 3 orders of magnitude. In addition, CHAOS is light-weight enough to satisfy the realtime requirement.
What problem does this paper attempt to address?