Abstract:Due to the system scaling, transient errors caused by external noise, e.g., heat fluxes and particle strikes, have become a growing concern for the current and upcoming exa-scale high-performance-computing (HPC) systems. Applications running on these systems are expected to experience transient errors more frequently than ever before, which will either lead them to generate incorrect outputs or cause them to crash. However, since such errors are still quite rare as compared to no-fault cases, desirable solutions call for low/no-overhead systems that do not compromise the performance under no-fault conditions and also allow very fast fault recovery to minimize downtime. In this article, we present IterPro, a light-weight compiler-assisted resilience technique to quickly and accurately recover processes from transient-error-induced crashes. During the compilation of applications, IterPro constructs a set of recovery kernels for crash-prone instructions. These recovery kernels are executed to repair the corrupted process states on-the-fly upon occurrences of errors, enabling applications to continue their executions instead of being terminated. When constructing recovery kernels, IterPro exploits side effects introduced by induction variable based code optimization techniques based on loop unrolling and strength reduction to improve its recovery capability. To this end, two new code transformation passes are introduced to expose the side effects for resilience purposes. We evaluated IterPro with 4 scientific workloads as well as the NPB benchmarks suite. During their normal execution, IterPro incurs almost zero runtime overhead and a small, fixed 27MB memory overhead. Meanwhile, IterPro can recover on an average 83.55 percent of crash-causing errors within dozens of milliseconds with negligible downtime. We also evaluated IterPro with parallel jobs running on 3072 cores and showed that IterPro can successfully mask the impact of crash-causing errors by providing almost uninterrupted execution. Finally, we present our preliminary evaluation result for BLAS, which shows that IterPro is capable of recovering failures in libraries with a very high coverage rate of 83 percent and negligible overheads. With such an effective recovery mechanism, IterPro could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future exa-scale systems.

HPC-Crash: Characterizing Crash-Proneness of HPC Programs from Various Perspectives

Automatic Identification of Crash-inducing Smart Contracts

P F ] 1 3 A ug 2 01 9 HPC AI 500 : A Benchmark Suite for HPC AI Systems

HPC AI500: A Benchmark Suite for HPC AI Systems

Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC.

HPCC: a Memory Access Model Oriented Benchmark--a Potential Test Method to Replace HPL in TOP500

Automatically Assessing Crashes from Heap Overflows

CRAC: an Automatic Assistant Compiler of Checkpoint/restart for OpenCL Program

Application-Level Resilience Modeling for HPC Fault Tolerance

Indexing Noncrashing Failures: A Dynamic Program Slicing-Based Approach

Job Failures in High Performance Computing Systems: A Large-Scale Empirical Study

FlipTracker: Understanding Natural Error Resilience in HPC Applications

Power Profile Monitoring and Tracking Evolution of System-Wide HPC Workloads

Near-Zero Downtime Recovery From Transient-Error-Induced Crashes

Postmortem Program Analysis with Hardware-Enhanced Post-Crash Artifacts.

A Review of HPC applications in High-Speed Rail Systems

Igor: Crash Deduplication Through Root-Cause Clustering

A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism

FT-Offload: A Scalable Fault-Tolerance Programing Model on MIC Cluster

Failure Analysis and Quantification for Contemporary and Future Supercomputers

A Visual Comparison of Silent Error Propagation