Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance

Erlin Yao,Jiutian Zhang,Mingyu Chen,Guangming Tan,Ninghui Sun
DOI: https://doi.org/10.1177/1094342015578487
2015-01-01
Abstract:Soft errors in scientific computing applications are becoming inevitable with the ever-increasing system scale and execution time, and new technologies that feature increased transistor density and lower voltage. Soft errors can be mainly classified into two categories: bit-flipping error e.g. 1 becomes −1 in random access memory; and computation error e.g. 1+1 = 3 in floating point units. Traditionally, bit-flipping error is handled by the Error Correcting Code ECC technique, and computation error is dealt with the Triple Modular Redundancy TMR method. Note that, ECC cannot handle computation error, while TMR cannot deal with bit-flipping error and is not efficient on handling computation error. To uniformly and efficiently handle both computation and bit-flipping errors in matrix operations, the Algorithm-Based Fault Tolerance ABFT method is developed. This paper focuses on the detection of soft errors in the LU Decomposition with Partial Pivoting LUPP algorithm, which is widely used in scientific computing applications. First, this paper notes that existing ABFT methods are not adequate to detect soft errors in LUPP in terms of time or space. Then we propose a new ABFT algorithm which can detect soft errors in LUPP both flexible in time and comprehensive in space. Flexible in time means that soft errors can be detected flexibly during the execution instead of only at the end of LUPP, while comprehensive in space indicates that all of the elements in data matrices L and U will be covered for detecting soft errors. To show the feasibility and efficiency of the proposed algorithm, this paper has incorporated it into the implementation of LUPP in the widely used benchmark High Performance Linpack HPL. Experiment results verify the feasibility of this algorithm: for soft errors injected at various timings and to different elements in LUPP, this algorithm has detected most of the injected errors, which have covered all of the errors that cannot pass the residual check of HPL. Both theoretical overhead analysis and experiments demonstrate that this ABFT algorithm is also very efficient at detecting soft errors in LUPP.
What problem does this paper attempt to address?