Hunting Killer Tasks for Cloud System through Behavior Pattern Learning

Hongyan Tang,Ying Li,Tong Jia,Zhonghai Wu
DOI: https://doi.org/10.1109/DSN-W.2016.31
2016-01-01
Abstract:Motivated by frequent failures in cloud computing systems, we analyze failure frequency and continuity of tasks from the Google cloud cluster, and find what we call killer tasks that suffer from frequent failures and repeated rescheduling. Killer task can be a big concern in cloud systems as it causes unnecessary resource wasting and significant increase of scheduling workloads. In this paper, we investigate characteristics and behavior patterns of killer tasks, then develop an approach to recognize killer tasks at the very early stage of their occurrence so that they can be addressed proactively instead of being rescheduled repeatedly. The empirical results show that our approach performs at 97% of precision in recognizing killer tasks with a maximal 1,164 minutes of lead time and 89% of resource saving for the cloud system on average.
What problem does this paper attempt to address?