Towards Reliable AI Applications Via Algorithm-Based Fault Tolerance on NVDLA

Mustafa Tarik Sanic,Cong Guo,Jingwen Leng,Minyi Guo,Weiyin Ma
DOI: https://doi.org/10.1109/msn57253.2022.00120
2022-01-01
Abstract:With the development of deep neural networks (DNNs), more complex accelerators have been designed for more sophisticated networks. Naturally, the complexity of accelerators makes them vulnerable to transient errors. Also, some DNN accelerators are widely used the safety-critical systems, such as autonomous vehicles. Therefore, the susceptibility to transient errors makes research on mitigation techniques more significant, and errors of accelerators should be limited to none. Some researchers proposed the modular redundancy method, which offers a highly reliable way but also considerably increases overhead. In this regard, algorithm-based solutions offer cheaper solutions. However, their implementation is primarily observed in software-based error injections. In this study, we propose a novel approach that focuses on implementing algorithm-based error detection (ABED) for RTL-level (hardware-based) error injections. Previous studies generally focused on the impact of soft errors in memory structures of embedded system-based accelerators. However, the main goal of this research is to study the impact of soft errors in processing elements and how to mitigate them. We implement an algorithm-based error detection that utilizes checksums for verifying convolution operations with low overhead. We first explain how to overcome the challenges of implementing ABED on FPGA-based accelerators, then how to implement it. We implement and evaluate our solution on an industry-level DNN accelerator called NVIDIA deep learning accelerator (NVDLA). In this study, our error injection method is constructed to test the most common soft error scenarios in processing units. The results of the research show that algorithm-based fault tolerance can detect all silent data corruptions (SDC) while maintaining a very low overhead (6-23%) on runtime.
What problem does this paper attempt to address?