Application-Level Resilience Modeling for HPC Fault Tolerance

Luanzheng Guo,Hanlin He,Dong Li
DOI: https://doi.org/10.48550/arXiv.1705.00267
2017-04-30
Distributed, Parallel, and Cluster Computing
Abstract:Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently, we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides little information on how fault tolerance happens, and RFI results are often not deterministic due to its random nature. In this paper, we introduce a new methodology to quantify the application resilience. Our methodology is based on the observation that at the application level, the application resilience to faults is due to the application-level fault masking. The application-level fault masking happens because of application-inherent semantics and program constructs. Based on this observation, we analyze application execution information and use a data-oriented approach to model the application resilience. We use our model to study how and why HPC applications can (or cannot) tolerate faults. We demonstrate tangible benefits of using the model to direct fault tolerance mechanisms.
What problem does this paper attempt to address?