Abstract:ABSTRACTScientific applications use checkpointing for failure recovery. The existing checkpointing approaches were proposed for storing persistent states of applications as checkpoints in disk-based file systems via the block interface. As non-volatile main memory (NVMM) will be included in high-performance computing systems, storing the checkpoints in NVMM-based file systems can significantly waste the performance benefits of NVMM. This is because it under-utilizes memory resources and it does not take advantage of the byte-addressability of NVMM. In this paper, we propose an NVMM-aware checkpointing approach, named NV-Checkpoint. It uses a compiler-aided technique to automatically generate multi-version data structures, which consist of both the persistent version of data stored in NVMM for failure recovery and the ephemeral version of data placed across DRAM and NVMM. Because of the byte-addressability of NVMM, any versions can be accessed via the memory interface. The multiple versions may share data that are not mutated during the program's execution to reduce data redundancy. NV-Checkpoint provides the same level of guarantee of failure recovery compared to the conventional checkpointing approaches proposed for file systems. Furthermore, its runtime system manages the layout of the data structures to reduce the number of writes to NVMM. It also manages the checkpointing frequency to reduce persistence overhead using machine learning models. Our experimental results with real-world scientific applications show that the performance of annotated programs with NV-Checkpoint using a hybrid of DRAM and NVMM matches the performance of best-effort hand-written versions. It achieves similar scalability as those with ephemeral data structures using only DRAM. It offers up to 121X speedup of execution time compared to the conventional checkpointing approaches using the Atlas parallel file system on the Titan supercomputer.

Determination of Checkpointing Intervals for Malleable Applications

Designing an Adaptive Application-Level Checkpoint Management System for Malleable MPI Applications

Optimal Multi-Level Interval-based Checkpointing for Exascale Stream Processing Systems

Selection of a checkpoint interval in a critical-task environment

Time-sharing Parallel Applications Through Performance-Targeted Feedback-Controlled Real-Time Scheduling.

Realizing Best Checkpointing Control in Computing Systems

A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems

Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors

Checkpointing of parallel applications throughdifferential memory functions

Research on Optimal Checkpointing-Interval for Flink Stream Processing Applications

Application Checkpoint and Power Study on Large Scale Systems

Optimal Checkpoint Interval with Availability as an Objective Function

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Resource allocation and aging priority-based scheduling of linear workflow applications with transient failures and selective imprecise computations

Hybrid Full/incremental Checkpoint/restart for MPI Jobs in HPC Environments

Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

JASS: A Flexible Checkpointing System for NVM-based Systems

Compiler aided checkpointing using crash-consistent data structures in NVMM systems

Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems

Efficient N-to-M Checkpointing Algorithm for Finite Element Simulations