Abstract:Nowadays many modern applications demand a high degree of software reliability, service availability, and guaranteed timeliness for critical task executions. Due to the rapid growth in variety and functional complexity, the existing methods and tools for developing sizable fault tolerant (FT) real-time (RT) distributed computing (DC) applications have become insufficient. The ROAFTS (Real-time Object-oriented Adaptive Fault Tolerance Support) middleware model has been evolving in the UCI DREAM Laboratory over the past decade as a reliable execution engine model for FT RT DC applications. ROAFTS integrates various mechanisms for fault detection and recovery in a form that meshes with high-level RT DC component-based programming schemes, in particular, the TMO (Time-triggered Message-triggered Object) programming scheme. ROAFTS is the first component-based support middleware model for FT RT DC. Previous ROAFTS model, however, has weaknesses in several important areas. Some major drawbacks are: (a) the incorporated network surveillance scheme does not consider network partitioning failures and network merging; (b) message transmission failure is not handled sufficiently rigorously; (c) the mechanism for collecting and processing failure reports from different parts of the system is weak and incomplete in that its completeness and correctness were not rigorously established. The work reported in this dissertation is to establish a complete and robust middleware model by taking the previous ROAFTS as a starting point and enhancing and integrating multi-level fault detection and recovery mechanisms. The resulting middleware model is called ROAFTS II. This dissertation work presents: (a) an improved RT fault detection scheme, SNS (Supervisor-based Network Surveillance) II, which is capable of detecting network partition events and network merges, and locating fault sources; (b) a reliable messaging protocol, called RMP, for fast detection and masking of message losses due to transient faults occurring on the communication paths; (c) a template based implementation technique enabling TMO application developers to easily implement primary-shadow TMO replicas; (d) a new mechanism for fault report handling and system reconfiguration; and (e) an extension software layer for managing home network devices. Each of these contributions facilitates the analysis of fault detection latency bounds. An experimental evaluation has been conducted. Considering the results, the middleware model presented represents an important step towards establishing a solid foundation for cost effective development of FT RT DC applications.

Fault-tolerance in a distributed management system: a case study

Research on Fault Tolerance in Hybrid P2P-based Collaborative Systems

Fundamentals of fault-tolerant distributed computing in asynchronous environments

Method and system for byzantine fault-tolerance replicating of data

Distributed Model Based on Data Partition and Load Balance Algorithm

The Research and Implementation of a CORBA-Based Architecture for Adaptive Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems: A Survey

Fault-tolerant Mechanism of the Distributed Cluster Computers

Distributed Fault-Tolerant Avionic Systems - A Real-Time Perspective

Fault-Tolerant Partial Replication in Large-Scale Database Systems

Fault Tolerance in Distributed Systems using Fused State Machines

A Fault Tolerant Object Management Framework Based On Middleware For Dynamic Reconfiguration

Design of COTS-based Fault-Tolerant Multiprocessor Real-Time Operating System

A Light-Weight Distributed System for the processing of Replicated Counter-like Objects

On the Performance Potential of Connection Fault-Tolerant Commit Processing in Mobile Environment

Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

Design And Performance Analysis Of Distributed Fault Tolerant Storage Systems

Robust Integration of Multi-Level Fault Detection Mechanisms and Recovery Mechanisms in a Component-Based Support Middleware Model for Fault-Tolerant Real-Time Distributed Computing

Fault Tolerance in Real-Time Systems: A Review

Byzantine Fault Tolerance in MDS of Grid System

Safety Evaluation of Critical Applications Distributed on TDMA-Based Networks