Detecting SDCs in GPGPUs Through an Efficient Instruction Duplication Mechanism

Xiaohui Wei,Nan Jiang,Xiaonan Wang,Hengshan Yue
DOI: https://doi.org/10.1007/978-3-030-82153-1_47
2021-01-01
Abstract:As General-Purpose Graphics Processing Units (GPGPUs) are widely used in High-Performance Computing (HPC) applications, the vulnerability of GPGPUs to soft errors becomes a critical concern. In this paper, we propose an efficient instruction duplication mechanism that merely duplicates SDC vulnerable instructions for reliability overhead saving. We first observe that the SDC proneness of individual instruction is related to its instruction type, fault propagation, and whether it affects shared memory. Then, leveraging these observed factors, we utilize machine learning to intelligently identify all the SDC vulnerable instructions of GPU applications and efficiently protect them. Experimental results show that our method achieves a 90.45% SDC coverage only duplicating 37.8% of static instructions, which achieves a significant improvement in terms of performance and SDC detection capability compared to the state-of-the-art duplication technique in GPUs.
What problem does this paper attempt to address?