DeepVigor+: Scalable and Accurate Semi-Analytical Fault Resilience Analysis for Deep Neural Network

Mohammad Hasan Ahmadilivani,Jaan Raik,Masoud Daneshtalab,Maksim Jenihhin

2024-10-21

Abstract:Growing exploitation of Machine Learning (ML) in safety-critical applications necessitates rigorous safety analysis. Hardware reliability assessment is a major concern with respect to measuring the level of safety. Quantifying the reliability of emerging ML models, including Deep Neural Networks (DNNs), is highly complex due to their enormous size in terms of the number of parameters and computations. Conventionally, Fault Injection (FI) is applied to perform a reliability measurement. However, performing FI on modern-day DNNs is prohibitively time-consuming if an acceptable confidence level is to be achieved. In order to speed up FI for large DNNs, statistical FI has been proposed. However, the run-time for the large DNN models is still considerably long. In this work, we introduce DeepVigor+, a scalable, fast and accurate semi-analytical method as an efficient alternative for reliability measurement in DNNs. DeepVigor+ implements a fault propagation analysis model and attempts to acquire Vulnerability Factors (VFs) as reliability metrics in an optimal way. The results indicate that DeepVigor+ obtains VFs for DNN models with an error less than 1\% and 14.9 up to 26.9 times fewer simulations than the best-known state-of-the-art statistical FI enabling an accurate reliability analysis for emerging DNNs within a few minutes.

Machine Learning,Hardware Architecture,Signal Processing

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of hardware reliability assessment of Deep Neural Networks (DNNs) in safety-critical applications. Specifically, the paper focuses on the following points: 1. **Complexity of Hardware Reliability Assessment**: With the increase in the number of DNN model parameters and computational load, traditional Fault Injection (FI) methods become very time-consuming and resource-intensive. Particularly in modern large-scale DNNs, conducting FI experiments to achieve an acceptable confidence level is almost impractical. 2. **Limitations of Existing Methods**: - **Statistical FI Methods**: Although they can reduce the number of simulations, they still require a significant amount of time and computational resources. - **Analytical Methods**: While they solve the scalability issue, they cannot provide accurate reliability metrics (such as Vulnerability Factors, VFs). 3. **Proposed New Method**: The paper introduces DeepVigor+, a scalable, fast, and accurate semi-analytical method for DNN reliability assessment. DeepVigor+ significantly reduces the required number of simulations through optimized fault propagation analysis and stratified sampling techniques while maintaining high accuracy. ### Main Contributions 1. **Introduction of DeepVigor+**: A new semi-analytical method capable of quickly and accurately obtaining the Vulnerability Factors (VFs) of each layer and the entire model of DNNs. 2. **Optimized Fault Propagation Analysis**: By assuming that a single fault may occur in the input or weights, the search space is effectively reduced. 3. **Stratified Sampling Technique**: Further accelerates the reliability analysis process without significantly affecting the analysis accuracy. 4. **Efficiency and Accuracy**: Experimental results show that the error of DeepVigor+ is less than 1%, and it reduces the number of simulations by 14.9 to 26.9 times compared to the state-of-the-art statistical FI methods. 5. **Open Source Tool**: DeepVigor+ is released as an open-source tool, enabling researchers to quickly assess the reliability of DNNs and design and develop fault-tolerant DNNs. ### Conclusion DeepVigor+ provides an efficient, accurate, and scalable method for the reliability assessment of DNNs, particularly suitable for large and deep emerging DNN models. This will help ensure the reliability and safety of DNNs in safety-critical applications.

DeepVigor+: Scalable and Accurate Semi-Analytical Fault Resilience Analysis for Deep Neural Network

DeepVigor: Vulnerability Value Ranges and Factors for DNNs' Reliability Assessment

A Systematic Literature Review on Hardware Reliability Assessment Methods for Deep Neural Networks

Improving Fault Tolerance for Reliable DNN Using Boundary-Aware Activation

DeepDyve: Dynamic Verification for Deep Neural Networks

A Deep Investigation on Stealthy DVFS Fault Injection Attacks at DNN Hardware Accelerators

Special Session: Approximation and Fault Resiliency of DNN Accelerators

APPRAISER: DNN Fault Resilience Analysis Employing Approximation Errors

Thales: Formulating and Estimating Architectural Vulnerability Factors for DNN Accelerators

Enhancing Fault Resilience of QNNs by Selective Neuron Splitting

Reliability Assurance for Deep Neural Network Architectures Against Numerical Defects

SAFFIRA: a Framework for Assessing the Reliability of Systolic-Array-Based DNN Accelerators

Path Analysis for Effective Fault Localization in Deep Neural Networks

Investigating the impact of transient hardware faults on deep learning neural network inference

Estimation of Small Failure Probability Based on Adaptive Subset Simulation and Deep Neural Network

ISimDL: Importance Sampling-Driven Acceleration of Fault Injection Simulations for Evaluating the Robustness of Deep Learning

Characterizing a Neutron-Induced Fault Model for Deep Neural Networks

DeepAxe: A Framework for Exploration of Approximation and Reliability Trade-offs in DNN Accelerators

Exploration of Activation Fault Reliability in Quantized Systolic Array-Based DNN Accelerators

Investigating Fault Injection Techniques in Hardware-Based Deep Neural Networks and Mutation-Based Fault Localization

Exposing Reliability Degradation and Mitigation in Approximate DNNs under Permanent Faults