Analysis of Frequently Failing Tasks and Rescheduling Strategy in the Cloud System

Hongyan Tang,Ying Li,Tong Jia,Xiaoyong Yuan,Zhonghai Wu
DOI: https://doi.org/10.4018/IJDST.2018010102
2018-01-01
Abstract:AbstractTo better understand task failures in cloud computing systems, the authors analyze failure frequency of tasks based on Google cluster dataset, and find some frequently failing tasks that suffer from long-term failures and repeated rescheduling, which are called killer tasks as they can be a big concern of cloud systems. Hence there is a need to analyze killer tasks thoroughly and recognize them precisely. In this article, the authors first investigate resource usage pattern of killer tasks and analyze rescheduling strategies of killer tasks in Google cluster to find that repeated rescheduling causes large amount of resource wasting. Based on the above observations, they then propose an online killer task recognition service to recognize killer tasks at the very early stage of their occurrence so as to avoid unnecessary resource wasting. The experiment results show that the proposed service performs a 93.6% accuracy in recognizing killer tasks with an 87% timing advance and 86.6% resource saving for the cloud system averagely.
What problem does this paper attempt to address?