SAFFIRA: a Framework for Assessing the Reliability of Systolic-Array-Based DNN Accelerators

Mahdi Taheri,Masoud Daneshtalab,Jaan Raik,Maksim Jenihhin,Salvatore Pappalardo,Paul Jimenez,Bastien Deveautour,Alberto Bosio
2024-03-05
Abstract:Systolic array has emerged as a prominent architecture for Deep Neural Network (DNN) hardware accelerators, providing high-throughput and low-latency performance essential for deploying DNNs across diverse applications. However, when used in safety-critical applications, reliability assessment is mandatory to guarantee the correct behavior of DNN accelerators. While fault injection stands out as a well-established practical and robust method for reliability assessment, it is still a very time-consuming process. This paper addresses the time efficiency issue by introducing a novel hierarchical software-based hardware-aware fault injection strategy tailored for systolic array-based DNN accelerators.
Artificial Intelligence,Hardware Architecture,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the problem of efficiently evaluating the reliability of deep neural network (DNN) hardware accelerators based on systolic arrays. Specifically, the paper focuses on how to quickly and accurately perform fault injection (FI) in safety-critical applications to assess the reliability and robustness of these accelerators in the face of hardware faults. ### Background and Challenges 1. **Importance of Reliability Evaluation**: In safety-critical applications, the reliability evaluation of DNN hardware accelerators is essential because hardware faults can severely impact the performance of DNNs. 2. **Limitations of Existing Methods**: - **Traditional Fault Injection Methods**: Although fault injection is a commonly used and effective reliability evaluation method, traditional fault injection methods are very time-consuming, especially on large-scale DNN accelerators. - **Hardware-agnostic Fault Injection Tools**: These tools do not consider the underlying hardware, thus failing to accurately simulate actual hardware behavior. - **Hardware-aware Fault Injection Tools**: While these tools consider the hardware, they usually require a significant amount of computational resources and time. ### Main Contributions of the Paper 1. **Proposed a New Hierarchical Software Fault Injection Strategy**: This strategy is specifically designed for DNN accelerators based on systolic arrays. By using a unified recursive equation (URE) system to model the systolic array core, the speed of fault injection is significantly improved. 2. **Developed an Open-source Tool SAFFIRA**: This tool implements the aforementioned fault injection strategy, reducing the fault injection time to 1/3 or even 1/2000 of the existing methods while ensuring accuracy. 3. **Introduced a New Reliability Metric - Faulty Distance**: This metric can better evaluate the classification performance of DNNs under fault conditions. 4. **Performance Evaluation on the Latest DNN Benchmarks**: The effectiveness and efficiency of the framework were validated. ### Experimental Results - **Permanent Fault Injection Experiments**: Permanent fault injection experiments were conducted on different quantized versions (8-bit and 16-bit integers) of the LeNet-5 network. The results showed that the 16-bit network performed better under permanent faults than the 8-bit network. - **Transient Fault Injection Experiments**: Transient fault injection experiments were conducted on the AlexNet, VGG-16, and ResNet-18 networks, evaluating the fault sensitivity and reliability of different networks. ### Conclusion The paper proposes an efficient hierarchical fault injection strategy that significantly improves the speed of reliability evaluation for systolic array-based DNN accelerators through software modeling and optimization. This method not only far exceeds existing methods in terms of time efficiency but also maintains a high level of accuracy. Additionally, the proposed faulty distance metric provides a new perspective for evaluating the fault robustness of DNNs.