Abstract:With the development of deep neural networks (DNNs), more complex accelerators have been designed for more sophisticated networks. Naturally, the complexity of accelerators makes them vulnerable to transient errors. Also, some DNN accelerators are widely used the safety-critical systems, such as autonomous vehicles. Therefore, the susceptibility to transient errors makes research on mitigation techniques more significant, and errors of accelerators should be limited to none. Some researchers proposed the modular redundancy method, which offers a highly reliable way but also considerably increases overhead. In this regard, algorithm-based solutions offer cheaper solutions. However, their implementation is primarily observed in software-based error injections. In this study, we propose a novel approach that focuses on implementing algorithm-based error detection (ABED) for RTL-level (hardware-based) error injections. Previous studies generally focused on the impact of soft errors in memory structures of embedded system-based accelerators. However, the main goal of this research is to study the impact of soft errors in processing elements and how to mitigate them. We implement an algorithm-based error detection that utilizes checksums for verifying convolution operations with low overhead. We first explain how to overcome the challenges of implementing ABED on FPGA-based accelerators, then how to implement it. We implement and evaluate our solution on an industry-level DNN accelerator called NVIDIA deep learning accelerator (NVDLA). In this study, our error injection method is constructed to test the most common soft error scenarios in processing units. The results of the research show that algorithm-based fault tolerance can detect all silent data corruptions (SDC) while maintaining a very low overhead (6-23%) on runtime.

Towards Reliable AI Applications Via Algorithm-Based Fault Tolerance on NVDLA

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

Special Session: Approximation and Fault Resiliency of DNN Accelerators

SAFFIRA: a Framework for Assessing the Reliability of Systolic-Array-Based DNN Accelerators

Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems

A Deep Investigation on Stealthy DVFS Fault Injection Attacks at DNN Hardware Accelerators

Efficient Error-Tolerant Quantized Neural Network Accelerators

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

Exploration of Activation Fault Reliability in Quantized Systolic Array-Based DNN Accelerators

Detect and Replace: Efficient Soft Error Protection of FPGA-Based CNN Accelerators

Implementation of Highly Reliable Convolutional Neural Network with Low Overhead on Field-Programmable Gate Array

An Energy-Efficient Neural Network Accelerator With Improved Resilience Against Fault Attacks

Algorithm-Based Fault Tolerance for Convolutional Neural Networks

DeepVigor+: Scalable and Accurate Semi-Analytical Fault Resilience Analysis for Deep Neural Network

Artificial neural networks for online error detection

SoftSNN: Low-Cost Fault Tolerance for Spiking Neural Network Accelerators under Soft Errors

Systematic Reliability Evaluation of FPGA Implemented CNN Accelerators

Automated design of error-resilient and hardware-efficient deep neural networks

Shavette: Low Power Neural Network Acceleration via Algorithm-level Error Detection and Undervolting

APPRAISER: DNN Fault Resilience Analysis Employing Approximation Errors

A Survey on Impact of Transient Faults on BNN Inference Accelerators