Investigating the impact of transient hardware faults on deep learning neural network inference
Md Hasanur Rahman,Sabuj Laskar,Guanpeng Li
DOI: https://doi.org/10.1002/stvr.1873
2024-02-03
Software Testing Verification and Reliability
Abstract:This study investigates the impact of transient hardware faults and algorithmic inaccuracies on DNN misclassifications in terms of safety‐critical behavior. Specifically, this study offers a more comprehensive understanding of the impact of multifaceted factors influencing the likelihood of safety‐critical misclassifications across different DNN models. Our thorough findings highlight that transient hardware faults pose a greater risk than intrinsic algorithmic inaccuracies to cause safety‐critical misclassifications. Summary Safety‐critical applications, such as autonomous vehicles, healthcare, and space applications, have witnessed widespread deployment of deep neural networks (DNNs). Inherent algorithmic inaccuracies have consistently been a prevalent cause of misclassifications, even in modern DNNs. Simultaneously, with an ongoing effort to minimize the footprint of contemporary chip design, there is a continual rise in the likelihood of transient hardware faults in deployed DNN models. Consequently, researchers have wondered the extent to which these faults contribute to DNN misclassifications compared to algorithmic inaccuracies. This article delves into the impact of DNN misclassifications caused by transient hardware faults and intrinsic algorithmic inaccuracies in safety‐critical applications. Initially, we enhance a cutting‐edge fault injector, TensorFI, for TensorFlow applications to facilitate fault injections on modern DNN non‐sequential models in a scalable manner. Subsequently, we analyse the DNN‐inferred outcomes based on our defined safety‐critical metrics. Finally, we conduct extensive fault injection experiments and a comprehensive analysis to achieve the following objectives: (1) investigate the impact of different target class groupings on DNN failures and (2) pinpoint the most vulnerable bit locations within tensors, as well as DNN layers accountable for the majority of safety‐critical misclassifications. Our findings regarding different grouping formations reveal that failures induced by transient hardware faults can have a substantially greater impact (with a probability up to 4 × higher) on safety‐critical applications compared to those resulting from algorithmic inaccuracies. Additionally, our investigation demonstrates that higher order bit positions in tensors, as well as initial and final layers of DNNs, necessitate prioritized protection compared to other regions.
computer science, software engineering