SCR algorithm

Wei Xiao-Hui,Ju Jiu-Bin
DOI: https://doi.org/10.1145/309829.309839
1999-01-01
ACM SIGOPS Operating Systems Review
Abstract:Fault-tolerance is very important in cluster computing. Many famous cluster-computing systems have implemented fault-tolerance by using checkpoint/restart mechanism. But existent checkpointing algorithms can not restore the states of a file system when roll-backing the running of a program, so there are many restrictions on file accesses in existent fault-tolerance systems. SCR algorithm, an algorithm based on atomic operation and consistent schedule, which can restore the states of file systems, is present in this paper. In SCR algorithm, system calls on file sytems are classified into idempotent operations and non-idempotent operations. A non-idempotent operation modifies a file system's states, and an idempotent operation does not. SCR algorithm dynamically follows the tracks of a program's running, logs each non-idempotent operation used by the program and the information that can restore the operation in disks. When checkpointing roll-backing the program, SCR algorithm will revert the file system states to the last checkpoint time. By using SCR algorithm, users are allowed to use any file operation in their programs.
What problem does this paper attempt to address?