GLAM-SERP: Building a Graph Learning-Assisted Model for Soft Error Resilience Prediction in GPGPUs.

Xiaohui Wei,Jianpeng Zhao,Nan Jiang,Hengshan Yue
DOI: https://doi.org/10.1007/978-981-97-0859-8_25
2024-01-01
Abstract:Due to their efficient data-parallel computing capabilities, General-Purpose Graphics Processing Units (GPGPUs) have become increasingly prevalent in deep learning and scientific computing domains. Because of the growing chip integration density, GPGPUs are becoming more susceptible to soft errors, which can cause catastrophic results in safety-critical systems. Consequently, conducting a GPGPU program error resilience analysis is essential to provide guidance for enhancing reliability. Unfortunately, traditional analysis methods such as Statistical Fault Injection (SFI) suffer from colossal time overhead, while machine learning-based resilience prediction methods are constrained in characterizing program error propagation behavior. To address the above challenges, we propose GLAM-SERP, a Graph Learning-Assisted Model for Soft Error Resilience Prediction in GPGPUs. Our critical insight is that the error resilience of GPGPU instructions is related to their inherent properties and error propagation characteristics. Thus, we construct a Dependency Graph (DG) for the GPGPU program, where nodes represent individual GPGPU instructions, node features capture the resilience characteristics of each instruction, and graph edges depict the error propagation pathway between GPGPU instructions. Based on the established DG, we then drive a Graph Attention Network (GAT) to predict Silent Data Corruption (SDC) proneness of GPGPU instructions under the soft error influences. The experimental results demonstrate that our approach achieves an average prediction performance of 94.14% on individual programs and 89.50% on unseen programs, showcasing its accurate and general error resilience prediction capability in GPGPUs.
What problem does this paper attempt to address?