A Black-Box Approach for Detecting the Failure Traces

you meng,lang yu,zhongzhi luan,depei qian,ming xie,zhigao du
DOI: https://doi.org/10.1007/978-3-662-43908-1_32
2014-01-01
Abstract:Detecting failure traces can help system administrators timely recover from those failures and avoid them afterwards. For system managers, it is not difficult to detect whether a failure is currently occurring, because they only concern about several key measurements. If these measurements exceed the normal threshold, a failure event should be generated. But it is much more complicated to detect the failure traces which represented as failure related events. Because these failure traces may last for quite a long time and effect many components. Furthermore, current distributed system adds and removes new components so quickly that administrators may not have enough time and knowledge to set monitoring threshold for each of them. Based on these problems, we propose our FTD system. We first compare each component's historical state and get outlier states as anomalous event. And then, combined with the failure event that the system provided, we detect the event correlations between failure events and anomalous events as failure traces. A network intrusion benchmark KDD99 is used to evaluate our work and we achieve good performances.
What problem does this paper attempt to address?