A general fault-tolerance framework for grid computing

Xuanhua Shi,Hai Jin,Weizhong Qiang
DOI: https://doi.org/10.3321/j.issn:1671-4512.2006.07.014
2006-01-01
Abstract:A general fault-tolerance framework for grid computing is proposed which are dealt with hierarchical structure fault detection services and policy-based fault-handling method, based on the requirements of reliable grid computing. The bottom of the fault detection service is local fault detector, which monitors the objects in local area and sends heartbeat messages to the middle data collector; the middle data collector sends the status list of the monitored objects to the top data collectors within specific interval; the top data collector is managed by an index server. When any fault detected, the system chooses an appropriate fault-handling method, such as checkpointing, retrying, replication. The results of the performance evaluation show that this framework is scalable, high-efficiency and low-overhead.
What problem does this paper attempt to address?