Near-Zero Downtime Recovery From Transient-Error-Induced Crashes
Chao Chen,Greg Eisenhauer,Santosh Pande
DOI: https://doi.org/10.1109/tpds.2021.3096055
IF: 5.3
2022-04-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Due to the system scaling, transient errors caused by external noise, e.g., heat fluxes and particle strikes, have become a growing concern for the current and upcoming exa-scale high-performance-computing (HPC) systems. Applications running on these systems are expected to experience transient errors more frequently than ever before, which will either lead them to generate incorrect outputs or cause them to crash. However, since such errors are still quite rare as compared to no-fault cases, desirable solutions call for low/no-overhead systems that do not compromise the performance under no-fault conditions and also allow very fast fault recovery to minimize downtime. In this article, we present IterPro, a light-weight compiler-assisted resilience technique to quickly and accurately recover processes from transient-error-induced crashes. During the compilation of applications, IterPro constructs a set of recovery kernels for crash-prone instructions. These recovery kernels are executed to repair the corrupted process states on-the-fly upon occurrences of errors, enabling applications to continue their executions instead of being terminated. When constructing recovery kernels, IterPro exploits side effects introduced by induction variable based code optimization techniques based on loop unrolling and strength reduction to improve its recovery capability. To this end, two new code transformation passes are introduced to expose the side effects for resilience purposes. We evaluated IterPro with 4 scientific workloads as well as the NPB benchmarks suite. During their normal execution, IterPro incurs almost zero runtime overhead and a small, fixed 27MB memory overhead. Meanwhile, IterPro can recover on an average 83.55 percent of crash-causing errors within dozens of milliseconds with negligible downtime. We also evaluated IterPro with parallel jobs running on 3072 cores and showed that IterPro can successfully mask the impact of crash-causing errors by providing almost uninterrupted execution. Finally, we present our preliminary evaluation result for BLAS, which shows that IterPro is capable of recovering failures in libraries with a very high coverage rate of 83 percent and negligible overheads. With such an effective recovery mechanism, IterPro could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future exa-scale systems.
computer science, theory & methods,engineering, electrical & electronic