Evaluating the Soft Error Resilience of Graph Applications on GPGPUs.

Xiaohui Wei,Mengting Zhou,Nan Jiang,Hengshan Yue
DOI: https://doi.org/10.1109/bigdatasecurity62737.2024.00022
2024-01-01
Abstract:General-Purpose Graphics Processing Units (GPGPUs) are widely utilized for graph processing thanks to their high throughput, massive parallelism and powerful computing capacity. However, due to the increasing integration, GPGPUs are susceptible to soft errors, which can undermine the reliability of graph applications accelerated using GPGPUs. Typically, fault tolerance strategies such as thread replication and checkpoint mechanism are applied to ensure the reliability of program execution. However, these techniques require appropriate trade-off between reliability improvement and overhead. To this end, it is imperative to understand the error resilience profile of graph applications. In this paper, we construct a multidimensional analysis framework, which analyzes the error resilience of graph applications on GPGPUs from application, kernel function, and thread perspective. Based on the exhaustive statistical fault injection (FI) results for four graph algorithms, we propose heuristic suggestions as guidance for efficient fault tolerance strategies.
What problem does this paper attempt to address?