Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units

Juan-David Guerrero-Balaguera,Josie E. Rodriguez Condia,Fernando F. dos Santos,Matteo Sonza,Paolo Rech

2023-10-03

Abstract:Graphics Processing Units (GPUs) are over-stressed to accelerate High-Performance Computing applications and are used to accelerate Deep Neural Networks in several domains where they have a life expectancy of many years. These conditions expose the GPUs hardware to (premature) aging, causing permanent faults to arise after the usual end-of-manufacturing test. Techniques to assess the impact of permanent faults in GPUs are then strongly required, thus allowing to estimate the reliability risk and to possibly mitigate it. In this paper, we present a method to evaluate the effects of permanent faults affecting the GPU scheduler and control units, which are the most peculiar and stressed resources, along with the first figures that allow quantifying these effects. We characterize over 5.83x10^5 permanent fault effects in the scheduler and controllers of a gate-level GPU model. Then, we map the observed error categories in software by instrumenting the code of 13 applications and two convolutional neural networks, injecting more than 1.65x10^5 permanent errors. Our two-level fault injection strategy reduces the evaluation time from hundreds of years of gate-level evaluation to hundreds of hours.We found that faults in the GPU parallelism management units can modify the opcode, the addresses, and the status of thread(s) and warp(s). The large majority (up to 99%) of these hardware permanent errors impacts the running software execution. Errors affecting the instruction operation or resource management hang the code, while 45% of errors in the parallelism management or control-flow induce silent data corruptions.

Hardware Architecture

What problem does this paper attempt to address?

The problem this paper attempts to address is the evaluation of the impact of permanent faults in GPUs (Graphics Processing Units) on parallel management and control units. Specifically, the authors focus on: 1. **Evaluation of Permanent Faults**: Modern GPUs require long-term use, which makes the hardware susceptible to aging (i.e., permanent faults that appear after manufacturing tests). Therefore, it becomes particularly important to study how to evaluate the impact of these permanent faults on GPUs, especially in critical application areas. 2. **Impact of Faults on Parallel Management and Control Units**: The paper proposes a method to evaluate the impact of permanent faults in GPU schedulers, instruction fetch units, and decode units, and quantifies these impacts for the first time. By injecting a large number of permanent faults into gate-level models, the authors observed the impact of these faults on software execution parameters, as well as code hangs or silent data corruption caused by resource management and instruction code errors. 3. **Fault Injection and Evaluation Method**: To efficiently evaluate these permanent faults, the authors adopted a hybrid approach that combines precise gate-level fault simulation with flexible software-level fault injection. This method not only accurately simulates the impact of faults but also significantly reduces evaluation time. In summary, the main goal of this paper is to provide a method to evaluate and understand the impact of permanent faults in GPUs on parallel management and control units, thereby improving the reliability and safety of GPUs during long-term use.

Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units

Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUs

Evaluating the Soft Error Resilience of Graph Applications on GPGPUs.

Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUs

Evaluating the Soft Error Resilience of Instructions for GPU Applications

Can GPU performance increase faster than the code error rate?

Algorithmic Strategies for Sustainable Reuse of Neural Network Accelerators with Permanent Faults

Assessing the Impact of Compiler Optimizations on GPUs Reliability

Enabling predictable parallelism in single-GPU systems with persistent CUDA threads

G-SEPM: building an accurate and efficient soft error prediction model for GPGPUs

Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

Characterizing a Neutron-Induced Fault Model for Deep Neural Networks

Adaptive Multidimensional Parallel Fault Simulation Framework on Heterogeneous System

G-SEPM

A Spatially Correlated Competing Risks Time-to-Event Model for Supercomputer GPU Failure Data

Mitigating the Impact of Hardware Variability for GPGPUs Register File

Optimizing Non-Coalesced Memory Access for Irregular Applications with GPU Computing

Batch-Aware Unified Memory Management in GPUs for Irregular Workloads

Prediction of GPU Failures Under Deep Learning Workloads

GRAP: Efficient GPU-Based Redundancy Analysis Using Parallel Evaluation for Cross Faults

Characterizing the Execution Dynamics of GPGPU Applications