Abstract:Today’s high performance computing systems are made possible by multiple increases in hardware parallelity. This results in the decrease of mean time to failures of the systems with each newer generation, which is an alarming trend. Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. We have used Global Address Space Programming Interface (GASPI), which is a relatively new communication library based on the PGAS model. It fulfills the basic requirement of a fault tolerant communication library, i.e. the failure of a process does not cause the remaining processes to fail. This work is focused on extending the fault tolerance features of GASPI in the form of a supporting health-check library that applications can benefit from. These features include failure detection, its information propagation, recovery management, communication recovery, etc. To reinforce its utility, we have also developed a fault tolerant neighbor node-level checkpoint/restart library. Instead of introducing algorithm-based fault tolerance in its true sense, we demonstrate how (using these supplementary fault tolerance functions) one can build applications to allow integrate a low cost fault detection/recovery mechanism and, if necessary, recover the application on the fly. We showcase the usage of these tools by implementing them in three different applications. Two of the applications fall in the category of linear sparse solvers, whereas the third application is based on a fluid flow solver. We also analyze the overheads involved in failure-free cases as well as various failure cases. Our fault detection mechanism causes no overhead in failure-free cases, whereas in case of failure(s), the failure detection and recovery cost is of reasonably acceptable order and shows good scalability.

Building algorithmically nonstop fault tolerant MPI programs

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation

A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism

Algorithmic Based Fault Tolerance Applied to High Performance Computing

An Efficient Fault-Tolerant Scheduling Algorithm for Periodic Real-Time Tasks in Heterogeneous Platforms

MATCH: An MPI Fault Tolerance Benchmark Suite

A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

What does fault tolerant Deep Learning need from MPI?

FT-Offload: A Scalable Fault-Tolerance Programing Model on MIC Cluster

Building a Fault Tolerant Application Using the GASPI Communication Layer

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC.

OS kernel supported fault tolerant MPI

Building and utilizing fault tolerance support tools for the GASPI applications

TH-MPI: OS Kernel Integrated Fault Tolerant MPI.

Utilizing the Multi-threading Techniques to Improve the Two-Level Checkpoint/Rollback System for MPI Applications

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

Fault Tolerant One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs.

To Repair or Not to Repair: Assessing Fault Resilience in MPI Stencil Applications

Checkpoint-Restart Libraries Must Become More Fault Tolerant