Syslog Processing For Switch Failure Diagnosis And Prediction In Datacenter Networks

Shenglin Zhang,Weibin Meng,Jiahao Bu,Sen Yang,Ying Liu,Dan Pei,Jun Xu,Yu Chen,Hui Dong,Xianping Qu,Lei Song
DOI: https://doi.org/10.1109/IWQoS.2017.7969130
2017-01-01
Abstract:Syslogs on switches are a rich source of information for both post-mortem diagnosis and proactive prediction of switch failures in a datacenter network. However, such information can be effectively extracted only through proper processing of syslogs, e.g., using suitable machine learning techniques. A common approach to syslog processing is to extract (i.e., build) templates from historical syslog messages and then match syslog messages to these templates. However, existing template extraction techniques either have low accuracies in learning the "correct" set of templates, or does not support incremental learning in the sense the entire set of templates has to be rebuilt (from processing all historical syslog messages again) when a new template is to be added, which is prohibitively expensive computationally if used for a large datacenter network. To address these two problems, we propose a frequent template tree (FT-tree) model in which frequent combinations of (syslog) words are identified and then used as message templates. FT-tree empirically extracts message templates more accurately than existing approaches, and naturally supports incremental learning. To compare the performance of FT-tree and three other template learning techniques, we experimented them on two-years' worth of failure tickets and syslogs collected from switches deployed across 10+ datacenters of a tier-1 cloud service provider. The experiments demonstrated that FT-tree improved the estimation/prediction accuracy (as measured by F1) by 155% to 188%, and the computational efficiency by 117 to 730 times.
What problem does this paper attempt to address?