Checkpointing and Rollback Recovery for Network of Workstations

Dongsheng Wang,Weimin Zheng,Dingxing Wang,Meiming Shen
DOI: https://doi.org/10.1007/bf02917117
1999-01-01
Science China Technological Sciences
Abstract:Network of workstations (NOW) now becomes one of the main trends of parallel computing. But for long-running scientific programs, it needs effective fault tolerance for its changing property. Checkpointing and rollback recovery is a solution to this problem. First the main problems upon rollback recovery are discussed, the different checkpointing techniques for NOW are analyzed, and then the design and implementation of ChaRM (checkpoint-based rollback recovery and process migration) system are described. The comparison of three coordinated checkpointing systems is given.
What problem does this paper attempt to address?