An in-depth and insightful exploration of failure detection in distributed systems

Bhavana Chaurasia,Anshul Verma,Pradeepika Verma
DOI: https://doi.org/10.1016/j.comnet.2024.110432
IF: 5.493
2024-04-23
Computer Networks
Abstract:In today's world, everyone wants a good profit with a tiny investment and distributed computing is a boon for this purpose. Cloud computing, fog computing, and the Internet of Things (IoT) are well-known examples of distributed computing which provide good computing services and performance. However, providing reliable services in a real environment, which is failure-prone, remains a challenge. To address this issue, failure detectors are used in distributed systems, which are abstract modules responsible for detecting and monitoring the activity of nodes in order to determine whether they are faulty or not. In this paper, an approach is presented for the systematic literature review of failure detectors in distributed systems. Further, many existing review and survey papers on failure detectors are critically analyzed along with their key contributions and limitations. The classification of distributed systems is presented on the basis of the nodes' properties and the components of system models are described in detail. Various issues and challenges related to agreement and failure problems are also explored. The strengths and limitations of various existing failure detectors are discussed along with their comparative evaluation. Finally, fault-tolerance and recovery techniques are discussed and analyzed.
computer science, information systems,telecommunications,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?