libcrpm: Improving the Checkpoint Performance of NVM

Feng Ren,Kang Chen,Yongwei Wu
DOI: https://doi.org/10.1145/3489517.3530536
2022-01-01
Abstract:libcrpm is a new programming library to improve the checkpoint performance for applications running in NVM. It proposes the failure-atomic differential checkpointing protocol, which addresses two problems simtdtaneously that exist in the current NVM-based checkpoint-recovery libraries: (1) high write amplification when page-granularity incremental checkpointing is used, and (2) high persistence costs from excessive memory fence instructions when line-grained undo-log or copy-on-write is used. Evaluation results show that libcrpm reduces the checkpoint overhead in realistic workloads. For MPI-based parallel applications such as LULESH, the checkpoint overhead of libcrpm is only 44.78% of FTI, an application-level checkpoint-recovery library.
What problem does this paper attempt to address?