Evaluating the Soft Error Resilience of Instructions for GPU Applications

Xiaohui Wei,Ruyu Zhang,Yuanyuan Liu,Hengshan Yue,Jingweijia Tan
DOI: https://doi.org/10.1109/cse/euc.2019.00091
2019-01-01
Abstract:Graphics Processing Units (GPUs) are widely used in a range of High Performance Computing fields because of high parallelism. As the technology scaling down, GPUs are more susceptible to soft errors which dramatically impact the applications output qualities. Silent Data Corruption (SDC) is one of the most concerned reliability issues, which require efficient protection mechanisms to eliminate it. Software-directed instruction replication has been a flexible technique to solve SDCs. However, this method requires a trade-off between reliability and overhead. To this end, it is imperative to explore the SDC criticality of the instructions. In this paper, we carry out fine-grained analysis on instruction error behavior of 11 benchmarks, while previous work focused on the error resilience of the entire application. Combining the error resilience of instructions with the dynamic data flow of applications, we find potential protection opportunities for the instructions.
What problem does this paper attempt to address?