Abstract:Software fault-tolerant techniques require extensive process coordination when executed concurrently. Coordination is necessary to control the scope of process interaction, or to assess the extent of error propagation. It can be done synchronously as in the Conversation scheme, or asynchronously as in the Chase protocol. Process coordination becomes complicated in a loosely coupled distributed system because there is no shared memory and it is difficult to obtain an up-to-date global system state. This thesis studies error recovery protocols that supports distributed software fault-tolerance, with emphasis on their synchronization and coordination aspects. We construct queueing models to study the synchronous approach under different execution environments. We study system throughput and utilization in the presence of communication delay and process failures, and investigate scalability of this approach as the number of processes increases. Our results indicate that the synchronization requirement remains a performance bottleneck even when message transmission is instantaneous and when the processes are fault-free. Implications of the analytical results to distributed transaction processing will also be discussed. For asynchronous approach, we examine the cause-effect relation between processes, and study the components of a recovery subsystem that monitors dynamic process dependencies and directs process recovery. We compare and contrast various design alternatives to implement these components, and propose efficient asynchronous recovery protocols based on the Receiver-initiated approach. The proposed protocols are evaluated via simulation studies. They are shown to have superior throughput performance while at the same time avoid the cascaded abort problem that plagues previously proposed schemes.

Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks

Fault Tolerant Real-Time Scheduling Strategy for NC System Based on Rollback Recovery

A Multi-Cycle Checkpointing Protocol That Ensures Strict 1-Rollback

A time-based rollback recovery algorithm in mobile computing systems

Asynchronous recovery protocols for distributed systems

Precision clock synchronization method over network redundancy system

Rollback Algorithm and Crash Recovery Based on Fault-Sensitive Graphs

Hardware rollback recovery schemes for multiprocessor systems

ROLLBACK-RECOVERY ALGORITHM BASED ON THE CHECKPOINT DEPENDENCY GRAPH AND THE PROPERTY TABLE

Process Synchronization and Coordination in Error Recovery Protocols for Distributed Computing Systems

An Asynchronous Scheme for Rollback Recovery in Message-Passing Concurrent Programming Languages

Checkpointing and Rollback Recovery for Network of Workstations

Selective synchronization in multi-cycle checkpointing

CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing.

Combination of consistent checkpointing and message logging: A novel CRR scheme for clusters of workstations

Dependency-Aware Rollback and Checkpoint-Restart for Distributed Task-Based Runtimes

Checkpointing for Workflow Recovery

Log based rollback recovery algorithm in mobile environment

An efficient forward recovery checkpointing scheme in dissimilar redundancy computer system

Reliable Assurance Model for Distributed System Survivability

Dynamic Cluster Strategy for Hierarchical Rollback‐recovery Protocols in MPI HPC Applications