Abstract:This paper presents an in-depth examination of checkpoint-restart mechanisms in High-Performance Computing (HPC). It focuses on the use of Distributed MultiThreaded CheckPointing (DMTCP) in various computational settings, including both within and outside of containers. The study is grounded in real-world applications running on NERSC Perlmutter, a state-of-the-art supercomputing system. We discuss the advantages of checkpoint-restart (C/R) in managing complex and lengthy computations in HPC, highlighting its efficiency and reliability in such environments. The role of DMTCP in enhancing these workflows, especially in multi-threaded and distributed applications, is thoroughly explored. Additionally, the paper delves into the use of HPC containers, such as Shifter and Podman-HPC, which aid in the management of computational tasks, ensuring uniform performance across different environments. The methods, results, and potential future directions of this research, including its application in various scientific domains, are also covered, showcasing the critical advancements made in computational methodologies through this study.

What problem does this paper attempt to address?

The paper aims to explore and optimize the Checkpoint-Restart (C/R) mechanism in High-Performance Computing (HPC) environments, with a focus on the effectiveness of using Distributed MultiThreaded CheckPointing (DMTCP) technology both inside and outside containers. Specifically, the core objectives of the paper include: 1. **Improving the efficiency and reliability of managing complex computational tasks**: By periodically saving the state of running processes and restoring these state points after interruptions, ensuring that computational tasks can continue without the resource waste caused by starting from scratch. 2. **Enhancing the flexibility and robustness of HPC workflows**: Utilizing DMTCP technology to improve job scheduling flexibility in multi-threaded and distributed applications, reduce restart times, and enhance the overall resilience of the system. 3. **Exploring the application of HPC containers (such as Shifter and Podman-HPC)**: Simplifying software dependency management through the use of HPC container technology, ensuring consistent performance across different environments, and supporting more efficient task migration and recovery operations. 4. **Evaluating the performance of DMTCP in different computing environments**: Testing single-threaded and multi-threaded Geant4 simulations on the NERSC Perlmutter supercomputer to assess the performance differences of DMTCP inside and outside containers and its actual impact on improving computational efficiency. 5. **Developing methods for automated management and submission of C/R tasks**: Designing an automated script system based on DMTCP and Slurm to automatically detect signals, trigger checkpoints, and requeue jobs, thereby achieving seamless task cycle management. Through the above research, the paper hopes to provide scientists in the HPC field with a more efficient, reliable, and user-friendly solution to better address the challenges of large-scale computational tasks.

Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

Checkpoint and Restart: An Energy Consumption Characterization in Clusters

Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Optimizing Checkpoint Restart with Data Deduplication

A Two-Level Parallel Decomposition Approach for Transient Stability Constrained Optimal Power Flow

Utilizing the Multi-threading Techniques to Improve the Two-Level Checkpoint/Rollback System for MPI Applications

Multi-level Container Checkpoint Performance Optimization Strategy in SDDC

NavP: Enabling Navigational Programming for Science Data Processing via Application-Initiated Checkpointing

Container Migration Based on Combination of Remote Direct Memory Access and Check Point

Checkpointing as a Service in Heterogeneous Cloud Environments

Adapting the DMTCP Plugin Model for Checkpointing of Hardware Emulation

Dependency-Aware Rollback and Checkpoint-Restart for Distributed Task-Based Runtimes

Hybrid Full/incremental Checkpoint/restart for MPI Jobs in HPC Environments

Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems

DCU-CHK: Checkpointing for Large-Scale CPU-DCU Heterogeneous Computing Systems

AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems.

Performance Analysis and Optimization of a Hybrid Distributed Reverse Time Migration Application