DAW-DMR: Divergence-Aware Warped DMR with Full Error Detection for GPGPU S

Yukun Wei,Mingyu Wang,Haiqiu Huang,Wangguang Wang,Zhiyi Yu
DOI: https://doi.org/10.1109/isvlsi61997.2024.00039
2024-01-01
Abstract:General purpose graphics processing units (GPG-PUs) have emerged as a pivotal computing platform for scientific applications requiring high performance. However, reliability has become as important as performance. Previous research on error detection for GPGPUs has shown a relatively high performance penalty and low error coverage. To address these challenges, this work proposes Divergence-Aware Warped Dual Modular Redundancy (DAW-DMR) architecture, which fully exploits underutilized parallelism in GPGPUs to ensure both full error detection and ultra-low performance penalty. A divergence-based Adaptive Redundancy is employed to capture the runtime information about the level of branch divergence in warps, and flexibly switches between two redundancy architectures proposed for different dimensions. Within the warp, IntraWarp Redundancy is utilized through spatial redundancy for low performance penalty. Among different warps, fine-grained Inter-Warp Redundancy with a specialized architecture is leveraged through temporal redundancy for full error coverage. Experimental evaluations demonstrate that the proposed DAW-DMR architecture achieves an average performance improvement of over 90% and 15.6% compared to R-Naive and Warped-DMR, respectively, with a negligible performance penalty of 1.7 %. Furthermore, it attains mean error coverage improvement of 16 % compared to previous research, ensuring 100% error detection with minimal overheads on the area of 1.5 % and power consumption of 1.7 %.
What problem does this paper attempt to address?