HPC-Crash: Characterizing Crash-Proneness of HPC Programs from Various Perspectives

Xiaohui Wei,Shiyu Tong,Zhongao Sun,Fengyi Li,Xiang Li,Hengshan Yue
DOI: https://doi.org/10.1109/hpsc62738.2024.00023
2024-01-01
Abstract:High-Performance Computing (MC) systems are widely used to execute large-scale, complex applications. However, with the integration of more cores, HPC systems become more susceptible to soft errors, which can incur crash cases in HPC systems. Typically, the Checkpoint and Recovery (CR) mechanism is the standard solution to recover from crashes while it will incur significant overhead. Although many works have been proposed to minimize the overhead of CR mechanisms, we observe they may ignore the application's inherent error resilience features, thereby missing some opportunities for further CR mechanism optimization. Therefore, this paper proposes HPC-Crash, a framework that can fine-grained characterize the Crash characteristics of the HPC benchmark from various perspectives. By leveraging HPC-Crash to perform exhaustive statistics analysis for 5 HPC benchmarks, we observe instruction type, execution region, and application workflow are all essential factors for crash-proneness in HPC programs. The characterizing results of HPC-Crash will provide essential insight for HPC programmers to design more cost-effective CR solutions.
What problem does this paper attempt to address?