Combination of consistent checkpointing and message logging: A novel CRR scheme for clusters of workstations

Dongsheng Wang,Weimin Zheng,Meinung Seen,Dingxing Wang
IF: 1.019
1997-01-01
Chinese Journal of Electronics
Abstract:Checkpointing and Rollback Recovery (CRR) is a well-known technique used in the design of fault tolerant distributed/parallel computer systems. In this paper, we present a novel CRR technique for parallel workstation cluster systems, a combination of consistent checkpointing and message logging. The CRR technique offers several advantages, including fast output commit, limited rollback and simplified garbage collection. It also reduces the complexity of implementation. Meanwhile, the combination for compute-intensive applications in workstation cluster systems achieves lower failure-free overhead than ordinary message logging techniques.
What problem does this paper attempt to address?