Abstract:Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "Research on Checkpoint Mechanisms in Large - scale Deep Neural Network Training" aims to solve the fault - tolerance problem when deep learning (DL) frameworks conduct large - scale distributed training in high - performance computing (HPC) systems. Specifically, the paper focuses on the following aspects: 1. **Implementation of checkpoint mechanisms**: - Compares and analyzes the checkpoint mechanisms in currently popular deep learning frameworks such as Chainer, PyTorch, and TensorFlow. - Evaluates the checkpoint overheads of different frameworks at different scales, including computational cost, file format, and file size. 2. **Performance impact of checkpoint mechanisms**: - Studies the impact of checkpoint mechanisms on the performance of distributed training, especially in large - scale clusters. - Analyzes the checkpoint overheads and performance differences of different models (such as ResNet50 and VGG16) under different frameworks. 3. **Deterministic behavior**: - Verifies the deterministic behavior of different frameworks after restarting with checkpoints to ensure the repeatability and verifiability of training. 4. **Optimization suggestions**: - Provides discussion points for improving existing checkpoint mechanisms, helping users choose fault - tolerance frameworks suitable for HPC environments. - Offers improvement suggestions for framework developers to enhance the checkpoint performance of deep learning workloads in HPC systems. ### Main contributions 1. **Explore and compare checkpoint mechanisms of distributed computing DL frameworks**: - Explains in detail the checkpoint mechanisms of different frameworks and compares their design decisions. 2. **Measure and evaluate checkpoint overheads at different scales**: - Demonstrates the checkpoint overheads of different frameworks at different scales through experimental data, revealing the bottlenecks in existing checkpoint implementations. 3. **Study the deterministic behavior of DNN training**: - Verifies the deterministic behavior of different frameworks after restarting with checkpoints through experiments, ensuring the repeatability and verifiability of training. ### Experimental methods - **Experimental platforms**: - Two advanced HPC systems are used: Marenostrum and ABCI. - The Marenostrum system is configured with 52 nodes, each containing 2 IBM Power9 processors and 4 NVIDIA V100 GPUs. - The ABCI system is configured with 1088 nodes, each containing 2 Intel Xeon Gold 6148 processors and 4 NVIDIA Tesla V100 GPUs. - **Experimental settings**: - The Cifar10 dataset and ResNet50 and VGG16 models are used for experiments. - The experiments are divided into two groups: one is carried out on Marenostrum, and the other is carried out on ABCI. - The checkpoint overheads, file sizes, and deterministic behaviors of different frameworks at different scales are evaluated. ### Experimental results - **Computational cost**: - Table I shows the training time and checkpoint overheads of different frameworks at different scales. PyTorch shows the best performance in small - scale experiments, while Chainer performs excellently in large - scale experiments. - TensorFlow has the best performance in terms of checkpoint overhead, with an overall average overhead of only 2.3%. - **File size and format**: - Table II shows the checkpoint file sizes and formats of different frameworks under different models. Chainer has the smallest file size, but its performance is greatly affected. - The file size of PyTorch under the VGG16 model increases significantly, indicating that its serialization mechanism may not be optimized for different models. - **Large - scale checkpoints**: - Table III shows the results of large - scale experiments carried out on the ABCI system. Chainer performs excellently in large - scale experiments, but has obvious problems in terms of checkpoint overhead. - TensorFlow performs well in large - scale experiments and has the lowest checkpoint overhead. - **Deterministic behavior**: - Figures 2 and 3 show the deterministic behaviors of different frameworks after restarting with checkpoints. PyTorch performs better in terms of deterministic behavior, while Chainer and T

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Convergence-aware optimal checkpointing for exploratory deep learning training jobs

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

Optimizing Checkpoint Restart with Data Deduplication

BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations With Machine Learning

Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC

What does fault tolerant Deep Learning need from MPI?

A Multilevel Fault-Tolerance Technique for the DAG Data Driven Model

Quantifying the Impact of Memory Errors in Deep Learning

Checkpoint and Restart: An Energy Consumption Characterization in Clusters

DCU-CHK: Checkpointing for Large-Scale CPU-DCU Heterogeneous Computing Systems

AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency Analysis

Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems

Hybrid Full/incremental Checkpoint/restart for MPI Jobs in HPC Environments

An Efficient Checkpoint Strategy for Federated Learning on Heterogeneous Fault-Prone Nodes