Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Kai Zhao,Sheng Di,Sihuan Li,Xin Liang,Yujia Zhai,Jieyang Chen,Kaiming Ouyang,Franck Cappello,Zizhong Chen

DOI: https://doi.org/10.1109/tpds.2020.3043449

IF: 5.3

2021-01-01

IEEE Transactions on Parallel and Distributed Systems

Abstract:Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%$\sim$∼8% in both error-free and error-injected situations).

computer science, theory & methods,engineering, electrical & electronic

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the reliability issues caused by soft errors during the inference process of Convolutional Neural Networks (CNNs). Specifically: 1. **Background and Motivation**: - Convolutional Neural Networks are becoming increasingly important in many fields, especially in image classification, object detection, natural language processing, medical image analysis, etc. - CNN inference applications are deployed in safety-critical systems and may be affected by soft errors caused by high-energy particles, high temperatures, or abnormal voltages. - Ensuring the stability of the CNN inference process against soft errors is crucial. 2. **Limitations of Existing Methods**: - Traditional fault-tolerant methods (such as Error Correction Code ECC) cannot protect computational components and introduce high overhead. - Instruction duplication techniques require specific application and hardware optimizations, making it difficult to implement on all CNN accelerators. - Existing Algorithm-Based Fault Tolerance (ABFT) techniques cannot protect all convolution implementations. 3. **Research Contributions**: - Propose several checksum-based ABFT methods and analyze their fault protection capabilities and runtime. - Design a workflow that integrates multiple schemes to achieve high detection/correction capabilities with limited total runtime overhead. - Evaluate using ImageNet and various popular CNN models (AlexNet, VGG-19, ResNet-18, and YOLOv2). Experimental results show that the implementation can handle soft errors with limited runtime overhead (4%~248% overhead). Through these methods, the paper aims to improve the reliability and robustness of the CNN inference process, enabling stable operation in various environments.

Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Improving Fault Tolerance for Reliable DNN Using Boundary-Aware Activation

FTT-NAS: Discovering Fault-Tolerant Convolutional Neural Architecture

Cost-Effective Fault Tolerance for CNNs Using Parameter Vulnerability Based Hardening and Pruning

An Autonomous Error-Tolerant Architecture Featuring Self-reparation for Convolutional Neural Networks

DeepCNN: A Dual Approach to Fault Localization and Repair in Convolutional Neural Networks

Implementation of Highly Reliable Convolutional Neural Network with Low Overhead on Field-Programmable Gate Array

Soft Error Mitigation for Deep Convolution Neural Network on FPGA Accelerators.

A Survey on Impact of Transient Faults on BNN Inference Accelerators

FPGA Implementation of a Fault-Tolerant Fused and Branched CNN Accelerator With Reconfigurable Capabilities

Efficient Error-Tolerant Quantized Neural Network Accelerators

Soft Error Tolerant Convolutional Neural Networks on FPGAs with Ensemble Learning

Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs

Soft Error Reliability Analysis of Vision Transformers

Reliable Classification with Ensemble Convolutional Neural Networks.

Detect and Replace: Efficient Soft Error Protection of FPGA-Based CNN Accelerators

ApproxABFT: Approximate Algorithm-Based Fault Tolerance for Vision Transformers

Towards Enhancing Fault Tolerance in Neural Networks

Defective Convolutional Networks

Evaluation and Mitigation of Weight-Related Single Event Upsets in a Convolutional Neural Network

Exploring Winograd Convolution for Cost-effective Neural Network Fault Tolerance