Checkpointing of parallel applications throughdifferential memory functions

Xue Ruini,Chen Wenguang,Zheng Weimin
DOI: https://doi.org/10.3321/j.issn:1671-4512.2005.z1.031
2005-01-01
Journal of Huazhong University of Science and Technology
Abstract:As high-performance computing systems continue to grow in size and popularity,issues of fault tolerance and reliability turn into limiting factors on application scalability and system availability.Current fault tolerance systems for parallel applications through checkpoint/restart cannot handle the communication environment transparently.Sockets would be closed before checkpointing and reestablished after recovery,which is difficult to implement and prone to errors."Communication exclusion" based on differential memory function is proposed to separate the communication and computation modules in order to avoid dealing with sockets directly.Experimental results indicate a little improvement on checkpointing performance.The strategy is helpful on reducing implementation complexity and improving recovery reliability,and is easy to be ported due to its independency to any parallel system.
What problem does this paper attempt to address?