Abstract:The predicted reduced resiliency of next-generation high performance computers means that it will become necessary to take into account the effects of randomly occurring faults on numerical methods. Further, in the event of a hard fault occurring, a decision has to be made as to what remedial action should be taken in order to resume the execution of the algorithm. The action that is chosen can have a dramatic effect on the performance and characteristics of the scheme. Ideally, the resulting algorithm should be subjected to the same kind of mathematical analysis that was applied to the original, deterministic variant. The purpose of this work is to provide an analysis of the behaviour of the multigrid algorithm in the presence of faults. Multigrid is arguably the method of choice for the solution of large-scale linear algebra problems arising from discretization of partial differential equations and it is of considerable importance to anticipate its behaviour on an exascale machine. The analysis of resilience of algorithms is in its infancy and the current work is perhaps the first to provide a mathematical model for faults and analyse the behaviour of a state-of-the-art algorithm under the model. It is shown that the Two Grid Method fails to be resilient to faults. Attention is then turned to identifying the minimal necessary remedial action required to restore the rate of convergence to that enjoyed by the ideal fault-free method.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: The reduced predictability of next - generation high - performance computers means that the impact of random faults must be considered when designing numerical methods. Specifically, when the Multigrid Method is applied to large - scale linear algebra problems, how to ensure its stability and convergence in the event of hardware faults. In addition, when a hard fault occurs, it is necessary to determine what remedial measures should be taken to resume the execution of the algorithm, and these measures have a significant impact on the performance and characteristics of the algorithm. ### Core problems of the paper 1. **Behavior analysis of the multigrid algorithm in a faulty environment**: - Research on the behavior of the multigrid algorithm in the presence of random faults. - Provide a mathematical model to describe faults and analyze the performance of the multigrid algorithm under this model. 2. **Elasticity problems of the two - grid method**: - Prove that the Two Grid Method cannot maintain elasticity in a faulty environment. - Find the minimum necessary remedial measures to restore the convergence rate of the ideal fault - free method. ### Background and motivation As supercomputers develop towards the exascale level, it is expected that next - generation high - performance computers will face more hardware fault problems. These problems include random bit flips and logic state corruptions due to factors such as low - voltage logic thresholds, reduced cell capacitance, and minimized data movement. Therefore, traditional numerical methods need to be adjusted to adapt to this new computing environment. ### Methodology The author introduced a simple probability model to describe the impact of faults and applied it to iterative algorithms. Specifically: - Faults are modeled as Bernoulli random variables \(\chi\), representing the probability \(q\) of a fault occurring and the probability \(1 - q\) of no fault occurring. - When a fault occurs, the affected value \(x\) is replaced with \(\tilde{x}=\chi x\), that is, if a fault is detected, the original value is replaced with 0. - For each step in the multigrid algorithm (such as smoothing, restriction, prolongation, and coarse - grid correction), the impact of faults on them is considered separately, and the corresponding random iteration matrices are constructed. ### Main results - **Inelasticity of the two - grid method**: It is proved by Theorem 3 that the two - grid method cannot maintain elasticity in a faulty environment. - **Minimum necessary remedial measures**: It is pointed out by Theorem 4 that protecting the prolongation operation can restore the convergence rate of the ideal fault - free method. - **Numerical evidence**: Numerical experiments are provided to support the above theoretical results. ### Summary This research is the first to provide a mathematical model to describe faults and analyze the behavior of the multigrid algorithm under this model. The research results show that the current two - grid method is not elastic in a faulty environment, but its convergence performance can be restored through specific remedial measures. This provides a theoretical basis for designing more elastic multigrid algorithms in the future.

Is the Multigrid Method Fault Tolerant? The Two-Grid Case

Resilience for Exascale Enabled Multigrid Methods

Adaptive control in rollforward recovery for extreme scale multigrid

FT-GCR: a fault-tolerant generalized conjugate residual elliptic solver

TwinCG: Dual Thread Redundancy with Forward Recovery for Conjugate Gradient Methods

Convergence Analysis of a Multigrid Method for a Nonlocal Model.

Algorithmic Based Fault Tolerance Applied to High Performance Computing

An Adaptive Task-Level Fault-Tolerant Approach to Grid

Convergence Proof for the Multigrid Method of the Nonlocal Model

Multigrid on unstructured meshes with regions of low quality cells

Resilience in Numerical Methods: A Position on Fault Models and Methodologies

Toward Resilient Algorithms and Applications

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Resilience Against Soft Faults through Adaptivity in Spectral Deferred Correction

Generalizing Reduction-Based Algebraic Multigrid

Multigrid Monte Carlo Revisited: Theory and Bayesian Inference

Convergence analysis of inexact two-grid methods: A theoretical framework

Efficient Evaluation of Small Failure Probability in High-Dimensional Groundwater Contaminant Transport Modeling Via A Two-Stage Monte Carlo Method

Multigrid method for symmetric Toeplitz tridiagonal matrix: Convergence analysis and application to fractional Feynman-Kac equation

Multigrid Reduction‐In‐Time Convergence for Advection Problems: A Fourier Analysis Perspective

Uniform Convergence of Adaptive Multigrid Methods for Elliptic Problems and Maxwell's Equations