Abstract:Software@?s reliability in distributed systems has always been a major concern for all stake holders especially for application@?s vendors and its users. Various models have been produced to assess or predict reliability of large scale distributed applications including e-government, e-commerce, multimedia services, and end-to-end automotive solutions, but reliability issues with these systems still exists. Ensuring distributed system@?s reliability in turns requires examining reliability of each individual component or factors involved in enterprise distributed applications before predicting or assessing reliability of whole system, and Implementing transparent fault detection and fault recovery scheme to provide seamless interaction to end users. For this reason we have analyzed in detail existing reliability methodologies from viewpoint of examining reliability of individual component and explained why we still need a comprehensive reliability model for applications running in distributed system. In this paper we have described detailed technical overview of research done in recent years in analyzing and predicting reliability of large scale distributed applications in four parts. We first described some pragmatic requirements for highly reliable systems and highlighted significance and various issues of reliability in different computing environment such as Cloud Computing, Grid Computing, and Service Oriented Architecture. Then we elucidated certain possible factors and various challenges that are nontrivial for highly reliable distributed systems, including fault detection, recovery and removal through testing or various replication techniques. Later we scrutinize various research models which synthesize significant solutions to tackle possible factors and various challenges in predicting as well as measuring reliability of software applications in distributed systems. At the end of this paper we have discussed limitations of existing models and proposed future work for predicting and analyzing reliability of distributed applications in real environment in the light of our analysis.

SIMULATING THE RELIABILITY OF DISTRIBUTED SYSTEMS WITH UNRELIABLE NODES

An Efficient Algorithm for Reliability Lower Bound of Distributed Systems

Node Reliability: Approximation, Upper Bounds, and Applications to Network Robustness

Reliability Analysis of Distributed Storage Systems Considering Data Loss and Theft

Reliability Model of Distributed Simulation System

Reliability Assessment of Stochastic Networks with ER Connectivity and ER Dependency

Reliability Quantification of the Tree Structure Based Distributed System

Probabilistic Analysis on Connectivity for Sensor Grids with Unreliable Nodes

The Reliability of a Class of Two-Layer Networks with Unreliable Edges

Graph Theory-based Distribution System Reliability Evaluation

Reliability Assessment in Distributed Multi-State Series-Parallel Systems

Exact two-terminal reliability of some directed networks

A survey on reliability in distributed systems

Reliability assessment of complex electromechanical systems: A network perspective

Probabilistic Analysis on Mesh Network Fault Tolerance

Analytical probability propagation method for reliability analysis of general complex networks

Robustness on distributed coupling networks with multiple dependent links from finite functional components

A Study of Service Reliability and Availability for Distributed Systems

Recursive Method for Distribution System Reliability Evaluation

Estimation of dependability measures of Large-scale Markov-dependent consecutive-k-out-of-n: F repairable systems

Principled network reliability approximation: A counting-based approach