Thckpt: Transparent Checkpointing of Linux Processes under IA-64.

RN Xue,YH Zhang,WG Chen,WM Zheng
2005-01-01
Abstract:Checkpointing and Rollback Recovery (CRR) is a simple and effective technique for fault tolerance: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. CRR had been widely studied on most IA-32 platforms for both sequential and parallel applications, while little work was done to port it to IA-64. The new architecture and techniques introduced in IA-64 not only improves its performance and flexibility, but also the complexity, which makes it harder to checkpoint. This paper analyzes the special features of IA-64, strategies to solve the difficulties of checkpointing/recovery, and implementation details of a transparent sequential checkpointing library, under IA-64 for Linux kernel. Performance evaluation shows the well adaptability to different long running programs and the negligible overhead induced.
What problem does this paper attempt to address?