Time Series Based Killer Task Online Recognition Service: A Google Cluster Case Study

Tang Hongyan,Li Ying,Jia Tong,Yuan Xiaoyong,Wu Zhonghai
DOI: https://doi.org/10.1109/sose.2016.23
2016-01-01
Abstract:To better understand task failures in cloud computing systems, we analyze failure frequency of tasks based on Google cluster dataset, and find what we call as killer tasks that suffer from long-term failures and repeated rescheduling. Killer task can be a big concern of cloud systems as it causes unnecessary resource wasting and significant increase of scheduling workloads. Hence there is a need to provide a service for cloud system operators to recognize killer tasks in time. In this paper, we propose an online killer task recognition service based on the resource usage time series which can recognize killer tasks at the very early stage of their occurrence so that they can be handled appropriately instead of being rescheduled. The experiment results show that the proposed service performs a 93.6% accuracy in recognizing killer tasks with an 87% timing advance and 86.6% resource saving for the cloud system averagely.
What problem does this paper attempt to address?