Evaluating Performance Of Rescheduling Strategies In Cloud System

Tang Hongyan,Li Ying,Jia Tong,Wu Zhonghai
DOI: https://doi.org/10.1109/TrustCom.2016.0242
2016-01-01
Abstract:Motived by frequent failures in cloud computing systems, we aim to demystify the underlying rescheduling strategies performed by the scheduler to deal with task failure. In this paper, we comprehensively investigate and compare rescheduling strategies of two large-scale systems, Google cluster and CMU OpenCloud. Moreover, we quantitatively evaluate the performance of different rescheduling strategies and uncover the consequent negative impacts. Based on our analysis, we find that repeated rescheduling a task immediately after every failure without limitation, on one hand could improve the availability of tasks, on the other hand, will bring significant overhead and waste a large amount of resource. Furthermore, migrating failed tasks to different machines could be an effective way to successful execution finally, however with lower availability. Our analysis provides valuable guidance for the design of scheduler to achieve better trade-off between availability and resource saving in cloud system.
What problem does this paper attempt to address?