Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC

Madan Timalsina,Lisa Gerhardt,Nicholas Tyler,Johannes P. Blaschke,William Arndt
2024-07-27
Abstract:This paper presents an in-depth examination of checkpoint-restart mechanisms in High-Performance Computing (HPC). It focuses on the use of Distributed MultiThreaded CheckPointing (DMTCP) in various computational settings, including both within and outside of containers. The study is grounded in real-world applications running on NERSC Perlmutter, a state-of-the-art supercomputing system. We discuss the advantages of checkpoint-restart (C/R) in managing complex and lengthy computations in HPC, highlighting its efficiency and reliability in such environments. The role of DMTCP in enhancing these workflows, especially in multi-threaded and distributed applications, is thoroughly explored. Additionally, the paper delves into the use of HPC containers, such as Shifter and Podman-HPC, which aid in the management of computational tasks, ensuring uniform performance across different environments. The methods, results, and potential future directions of this research, including its application in various scientific domains, are also covered, showcasing the critical advancements made in computational methodologies through this study.
Distributed, Parallel, and Cluster Computing,Software Engineering
What problem does this paper attempt to address?
The paper aims to explore and optimize the Checkpoint-Restart (C/R) mechanism in High-Performance Computing (HPC) environments, with a focus on the effectiveness of using Distributed MultiThreaded CheckPointing (DMTCP) technology both inside and outside containers. Specifically, the core objectives of the paper include: 1. **Improving the efficiency and reliability of managing complex computational tasks**: By periodically saving the state of running processes and restoring these state points after interruptions, ensuring that computational tasks can continue without the resource waste caused by starting from scratch. 2. **Enhancing the flexibility and robustness of HPC workflows**: Utilizing DMTCP technology to improve job scheduling flexibility in multi-threaded and distributed applications, reduce restart times, and enhance the overall resilience of the system. 3. **Exploring the application of HPC containers (such as Shifter and Podman-HPC)**: Simplifying software dependency management through the use of HPC container technology, ensuring consistent performance across different environments, and supporting more efficient task migration and recovery operations. 4. **Evaluating the performance of DMTCP in different computing environments**: Testing single-threaded and multi-threaded Geant4 simulations on the NERSC Perlmutter supercomputer to assess the performance differences of DMTCP inside and outside containers and its actual impact on improving computational efficiency. 5. **Developing methods for automated management and submission of C/R tasks**: Designing an automated script system based on DMTCP and Slurm to automatically detect signals, trigger checkpoints, and requeue jobs, thereby achieving seamless task cycle management. Through the above research, the paper hopes to provide scientists in the HPC field with a more efficient, reliable, and user-friendly solution to better address the challenges of large-scale computational tasks.