AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems.

Jie Jia,Yi Liu,Yanke Liu,Yifan Chen,Fang Lin
DOI: https://doi.org/10.1007/978-3-031-69583-4_24
2024-01-01
Abstract:With the scaling-up of high-performance computing (HPC) systems, the resilience has become an important challenge. As a widely used resilience technique for HPC systems, checkpointing saves checkpoints of the system during the execution of parallel programs, and in case of failure, recovers the execution of the program from the most recent checkpoint. However, large-scale parallel programs often produce thousands of processes, and result in large-volume simultaneous data-writings on each checkpoint, which impacts the storage as well as the parallel file systems of HPC. To tackle this problem, this paper proposes AdapCK, an I/O-optimization scheme for checkpointing on large-scale HPC systems. AdapCK consists of two main parts: a load-balancing mechanism used for balancing workloads across low-level storage volumes on checkpointing, and a throughput-aware checkpoint-data writing mechanism used for reducing I/O contentions and increasing utilization of I/O-bandwidth. Experiment results show that the AdapCK can reduce the checkpoint time by more than 30%, up to 54.5%.
What problem does this paper attempt to address?