Abstract:Support Vector Machines (SVMs) are widely used in Machine Learning (ML) to perform classification. For a given element, an SVM computes a value using a kernel function and several support vectors to determine its class. Unfortunately, the computational units can be affected by externally induced phenomena such as soft errors; therefore, an erroneous result can be obtained. This is an issue when an SVM is used in safety-critical applications in which a change of the classification result is not acceptable. To ensure that errors do not change the classification result, traditional protection schemes can be used. For example, if the SVM computation is done on a processor, calculations can be executed twice, and the results get compared. If they are different, the error is detected, and the calculation can be done for a third time, voting can then be utilized to obtain the most likely result. However, this approach incurs in a large cost for computing resources and may not be acceptable when an SVM is used in resource constrained platforms such as Internet of Things (IoT) devices. In this paper, Result-based Re-computation (RBR) is proposed; RBR is an efficient technique to protect SVMs from errors in the kernel function, which is the most complex part in the SVM implementation. RBR is based on the observation that the SVM result is a sum of kernel terms to detect the terms that can modify the classification result and only these terms must be re-computed. The evaluation results using several publicly available datasets show that compared to a traditional protection scheme, the proposed RBR reduces up to 95.58% of the re-computation needed to protect an SVM against errors. Impact Statement-Error tolerance is critical in safety-critical Artificial Intelligence (AI) applications due to their dramatic consequences and safety implications. However, conventional protection solutions always incur in a large computational cost that could become prohibitive in resource-constrained platforms and application domains such as Internet of Things devices. The Algorithm-based Error Tolerance (ABET) approach for Support Vector Machines (SVMs) proposed in this paper has a significant advantage in terms of low computational demands. This advantage makes it very attractive in practice. The proposed algorithm reduces computational time with up to 95.58%, when compared to a classic protection scheme. This saving will promote the use of ABET approaches in industry, especially in embedded low-power systems.

Fault Tolerance in Iterative-Convergent Machine Learning

Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

On Misbehaviour and Fault Tolerance in Machine Learning Systems

An Error-Resilient Redundant Subspace Correction Method

Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training

Quantitative assessment of machine learning reliability and resilience

Convergence-aware optimal checkpointing for exploratory deep learning training jobs

Variance of ML-based software fault predictors: are we really improving fault prediction?

Light-Weight Fault Tolerant Attention for Large Language Model Training

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Building Single Fault Survivable Parallel Algorithms for Matrix Operations Using Redundant Parallel Computation

Testing and Validating Machine Learning Classifiers by Metamorphic Testing.

Improving Performance of Iterative Methods by Lossy Checkponting

Quantifying the Impact of Memory Errors in Deep Learning

A Fault-Tolerant Distributed Framework for Asynchronous Iterative Computations

TwinCG: Dual Thread Redundancy with Forward Recovery for Conjugate Gradient Methods

Iterative Learning Fault-Tolerant Control for Discrete-Time Nonlinear Systems Subject to Stochastic Actuator Faults

Towards Fault Tolerance in Multi-Agent Reinforcement Learning

Algorithmic Based Fault Tolerance Applied to High Performance Computing

Result-Based Re-computation for Error-Tolerant Classification by a Support Vector Machine.

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism