Selective synchronization in multi-cycle checkpointing

YiWei Ci,Zhan Zhang,Decheng Zuo,Zhibo Wu,Xiaozong Yang
2009-01-01
Abstract:In the distributed computing system, checkpointing techniques are often used to provide fault-tolerance. To improve the autonomy of checkpointing and to control the computation-loss of processes, communication induced checkpointing protocols have been proposed. However, in these protocols, the rollback distance cannot be determined by each process. For this reason, a multi-cycle checkpointing protocol is proposed in this paper. It allows processes to take checkpoints with different checkpoint cycles and makes it possible to select the specified checkpoint for recovery.
What problem does this paper attempt to address?