OS kernel supported fault tolerant MPI

Yingming Chen,Zhihui Du,Lin Peng,Sanli Li
2001-01-01
Abstract:Currently parallel cluster systems have a large number of computing nodes, and have a high risk of individual node failure. TH-MPI is a Linux kernel supported fault tolerant MPI. It optimizes the checkpointing by using Linux kernel module technology and diskless checkpointing. The results show good performance of checkpointing in TH-MPI.
What problem does this paper attempt to address?