A Checkpoint-Based High Availability Run-Time System for Windows NT Clusters.

Youhui Zhang,Dongsheng Wang
DOI: https://doi.org/10.1145/509526.509530
2002-01-01
Abstract:This paper presents a high availability run-time system----ChaRM-NT, a Checkpoint-based Rollback recovery system for parallel applications on a cluster of computers (COCs) based on Windows NT. ChaRM-NT implements an insert-mode, reduced coordinated checkpointing and rollback recovery (CRR) mechanism. Owing to the above techniques, ChaRM-NT can recover parallel applications from the checkpointing file upon system failures. In addition we have implemented a new coordinated checkpointing algorithm that only requires O(n) control messages where n is the number of participating processes. Independent on message passing environments (MPEs) ChaRM-NT implements a portable single process CRR library. Therefore it is very easy to adapt to different MPEs and it supports PVM and MPI for NT now.
What problem does this paper attempt to address?