Resilience Against Soft Faults through Adaptivity in Spectral Deferred Correction

Thomas Baumann,Sebastian Götschel,Thibaut Lunet,Daniel Ruprecht,Robert Speck
2024-12-01
Abstract:As supercomputers grow in hardware complexity, their susceptibility to faults increases and measures need to be taken to ensure the correctness of results. Some numerical algorithms have certain characteristics that allow them to recover from some types of faults. It has been demonstrated that adaptive Runge-Kutta methods provide resilience against transient faults without adding computational cost. Using recent advances in adaptive step size selection for spectral deferred correction (SDC), an iterative numerical time stepping scheme that can produce methods of arbitrary order, we show that adaptive SDC can also detect and correct transient faults. Its performance is found to be comparable to that of the dedicated resilience strategy Hot Rod.
Distributed, Parallel, and Cluster Computing,Numerical Analysis
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: as the hardware complexity of supercomputers increases, their susceptibility to faults also increases accordingly. In order to ensure the correctness of calculation results, measures need to be taken to improve the resilience of numerical algorithms when encountering soft faults. Specifically, this paper explores enhancing the resilience to transient faults through the Spectral Deferred Correction (SDC) method. ### Problem Background 1. **Impact of Hardware Faults** - As the scale and complexity of supercomputer systems keep increasing, the probability of hardware faults is also rising. - A fault in a single hardware component may lead to the failure of the entire system. For example, the fault in the Mariner 8 mission led to a launch failure. 2. **Fault Types** - **Hard Faults**: Hardware damage that persists until repaired. - **Soft Faults**: One - time events that will not recur during repeated operations. - **Silent Data Corruption**: This type of fault will not be detected immediately because the solution still seems reasonable, but it is actually wrong. 3. **Existing Solutions** - **At the Hardware Level**: Using Error - Correcting Codes (ECC), but this will increase memory usage and energy consumption. - **At the Software Level**: Strategies such as replication and checkpointing, but these methods usually come with significant overhead. ### Core Problem of the Paper This paper proposes a method based on Adaptive Spectral Deferred Correction (SDC), aiming to improve the resilience to soft faults in the following ways: - **Adaptive Time - Step Selection**: Detect and correct transient faults by estimating local errors and adjusting the time step. - **Adaptive Iteration Number Selection**: Judge convergence based on residuals or increments and restart if necessary to recover from faults. ### Main Contributions 1. **Detecting and Correcting Transient Faults**: It is shown that the adaptive SDC method can effectively detect and correct transient faults. 2. **Performance Comparison**: Compared with specialized fault - tolerance strategies (such as Hot Rod), it is found that the performance of adaptive SDC is comparable. 3. **Experimental Verification**: By inserting faults in multiple benchmark problems and calculating the recovery rate, the effectiveness of adaptive SDC is verified. ### Conclusion By introducing the Adaptive Spectral Deferred Correction method, this paper provides an efficient and low - overhead means to enhance the resilience of numerical algorithms to soft faults, providing new ideas for future large - scale parallel computing. ### Summary of Mathematical Formulas 1. **Basic Equation of Spectral Deferred Correction** \[ u_t = f(u), \quad u(t = 0)=u_0 \] 2. **Approximate Solution after Discretization** \[ u(\tau_m)\approx u_m = u_0+\sum_{j = 1}^M q_{mj}f(u_j) \] 3. **Local Error Estimation in Adaptive Step - Size Selection** \[ \epsilon=\|u^{(p)}-u^{(q)}\|_\infty=\|\delta^{(p)}-\delta^{(q)}\|_\infty=\|\delta^{(p)}\|_\infty+O(\Delta t^{q + 1}) \] 4. **Optimal Step - Size Calculation** \[ \Delta t_{\text{opt}}=\beta\Delta t\left(\frac{\epsilon_{\text{TOL}}}{\epsilon}\right)^{\frac{1}{p + 1}} \] where $\beta$ is a safety factor, usually set to 0.9. Through these methods, this paper successfully demonstrates the role of adaptive SDC in improving fault - tolerance capabilities.