Log-based Abnormal Task Detection and Root Cause Analysis for Spark
Siyang Le,BingBing Rao,Xiang Wei,Byungchul Tak,Long Wang,Liqiang Wang
DOI: https://doi.org/10.1109/icws.2017.135
2017-01-01
Abstract:Application delays caused by abnormal tasks are common problems in big data computing frameworks. An abnormal task in Spark, which may run slowly without error or warning logs, not only reduces its resident node's performance, but also affects other nodes' efficiency. Spark log files report neither root causes of abnormal tasks, nor where and when abnormal scenarios happen. Although Spark provides a "speculation" mechanism to detect straggler tasks, it can only detect tailed stragglers in each stage. Since the root causes of abnormal happening are complicated, there are no effective ways to detect root causes. This paper proposes an approach to detect abnormality and analyzes root causes using Spark log files. Unlike common online monitoring or analysis tools, our approach is a pure off-line method that can analyze abnormality accurately. Our approach consists of four steps. First, a parser preprocesses raw log files to generate structured log data. Second, in each stage of Spark application, we choose features related to execution time and data locality of each task, as well as memory usage and garbage collection of each node. Third, based on the selected features, we detect where and when abnormalities happen. Finally, we analyze the problems using weighted factors to decide the probability of root causes. In this paper, we consider four potential root causes of abnormalities, which include CPU, memory, network, and disk. The proposed method has been tested on real-world Spark benchmarks. To simulate various scenario of root causes, we conducted interference injections related to CPU, memory, network, and Disk. Our experimental results show that the proposed approach is accurate on detecting abnormal tasks as well as finding the root causes.