Abstract:ABSTRACTScientific applications use checkpointing for failure recovery. The existing checkpointing approaches were proposed for storing persistent states of applications as checkpoints in disk-based file systems via the block interface. As non-volatile main memory (NVMM) will be included in high-performance computing systems, storing the checkpoints in NVMM-based file systems can significantly waste the performance benefits of NVMM. This is because it under-utilizes memory resources and it does not take advantage of the byte-addressability of NVMM. In this paper, we propose an NVMM-aware checkpointing approach, named NV-Checkpoint. It uses a compiler-aided technique to automatically generate multi-version data structures, which consist of both the persistent version of data stored in NVMM for failure recovery and the ephemeral version of data placed across DRAM and NVMM. Because of the byte-addressability of NVMM, any versions can be accessed via the memory interface. The multiple versions may share data that are not mutated during the program's execution to reduce data redundancy. NV-Checkpoint provides the same level of guarantee of failure recovery compared to the conventional checkpointing approaches proposed for file systems. Furthermore, its runtime system manages the layout of the data structures to reduce the number of writes to NVMM. It also manages the checkpointing frequency to reduce persistence overhead using machine learning models. Our experimental results with real-world scientific applications show that the performance of annotated programs with NV-Checkpoint using a hybrid of DRAM and NVMM matches the performance of best-effort hand-written versions. It achieves similar scalability as those with ephemeral data structures using only DRAM. It offers up to 121X speedup of execution time compared to the conventional checkpointing approaches using the Atlas parallel file system on the Titan supercomputer.

Programming Support and Adaptive Checkpointing for High-Throughput Data Services with Log-Based Recovery

Main Memory Database Recovery Method Based on Shadow Paging and Hybrid Logging

Towards High Performance And High Availability Clusters Of Archived Stream

Low-Overhead Asynchronous Checkpointing in Main-Memory Database Systems

Progressive online aggregation in a distributed stream system

Adaptive Lazy Compaction with High Stability and Low Latency for Data-Intensive Systems

Dynamic Adaptive Checkpoint Mechanism for Streaming Applications Based on Reinforcement Learning

High Performance Data Persistence in Non-Volatile Memory for Resilient High Performance Computing

In Search of a Key Value Store with High Performance and High Availability

Adaptive Logging: Optimizing Logging and Recovery Costs in Distributed In-memory Databases.

On the Design of Reliable Heterogeneous Systems via Checkpoint Placement and Core Assignment.

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Adaptive Logging for Distributed In-memory Databases

A Study on the Method of Adaptive Time Intervals Checkpointing

User-level Checkpoint and Recovery for LAM/MPI

A Massive Data Storage and Management Strategy for Online Computer-Assisted Audit System

LogBase: A Scalable Log-structured Database System in the Cloud

Compiler aided checkpointing using crash-consistent data structures in NVMM systems

Protecting Synchronization Mechanisms of Parallel Big Data Kernels via Logging

ActiveSLA: a profit-oriented admission control framework for database-as-a-service providers.