A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Elvis Rojas,Albert Njoroge Kahira,Esteban Meneses,Leonardo Bautista Gomez,Rosa M Badia
DOI: https://doi.org/10.48550/arXiv.2012.00825
2021-03-30
Abstract:Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "Research on Checkpoint Mechanisms in Large - scale Deep Neural Network Training" aims to solve the fault - tolerance problem when deep learning (DL) frameworks conduct large - scale distributed training in high - performance computing (HPC) systems. Specifically, the paper focuses on the following aspects: 1. **Implementation of checkpoint mechanisms**: - Compares and analyzes the checkpoint mechanisms in currently popular deep learning frameworks such as Chainer, PyTorch, and TensorFlow. - Evaluates the checkpoint overheads of different frameworks at different scales, including computational cost, file format, and file size. 2. **Performance impact of checkpoint mechanisms**: - Studies the impact of checkpoint mechanisms on the performance of distributed training, especially in large - scale clusters. - Analyzes the checkpoint overheads and performance differences of different models (such as ResNet50 and VGG16) under different frameworks. 3. **Deterministic behavior**: - Verifies the deterministic behavior of different frameworks after restarting with checkpoints to ensure the repeatability and verifiability of training. 4. **Optimization suggestions**: - Provides discussion points for improving existing checkpoint mechanisms, helping users choose fault - tolerance frameworks suitable for HPC environments. - Offers improvement suggestions for framework developers to enhance the checkpoint performance of deep learning workloads in HPC systems. ### Main contributions 1. **Explore and compare checkpoint mechanisms of distributed computing DL frameworks**: - Explains in detail the checkpoint mechanisms of different frameworks and compares their design decisions. 2. **Measure and evaluate checkpoint overheads at different scales**: - Demonstrates the checkpoint overheads of different frameworks at different scales through experimental data, revealing the bottlenecks in existing checkpoint implementations. 3. **Study the deterministic behavior of DNN training**: - Verifies the deterministic behavior of different frameworks after restarting with checkpoints through experiments, ensuring the repeatability and verifiability of training. ### Experimental methods - **Experimental platforms**: - Two advanced HPC systems are used: Marenostrum and ABCI. - The Marenostrum system is configured with 52 nodes, each containing 2 IBM Power9 processors and 4 NVIDIA V100 GPUs. - The ABCI system is configured with 1088 nodes, each containing 2 Intel Xeon Gold 6148 processors and 4 NVIDIA Tesla V100 GPUs. - **Experimental settings**: - The Cifar10 dataset and ResNet50 and VGG16 models are used for experiments. - The experiments are divided into two groups: one is carried out on Marenostrum, and the other is carried out on ABCI. - The checkpoint overheads, file sizes, and deterministic behaviors of different frameworks at different scales are evaluated. ### Experimental results - **Computational cost**: - Table I shows the training time and checkpoint overheads of different frameworks at different scales. PyTorch shows the best performance in small - scale experiments, while Chainer performs excellently in large - scale experiments. - TensorFlow has the best performance in terms of checkpoint overhead, with an overall average overhead of only 2.3%. - **File size and format**: - Table II shows the checkpoint file sizes and formats of different frameworks under different models. Chainer has the smallest file size, but its performance is greatly affected. - The file size of PyTorch under the VGG16 model increases significantly, indicating that its serialization mechanism may not be optimized for different models. - **Large - scale checkpoints**: - Table III shows the results of large - scale experiments carried out on the ABCI system. Chainer performs excellently in large - scale experiments, but has obvious problems in terms of checkpoint overhead. - TensorFlow performs well in large - scale experiments and has the lowest checkpoint overhead. - **Deterministic behavior**: - Figures 2 and 3 show the deterministic behaviors of different frameworks after restarting with checkpoints. PyTorch performs better in terms of deterministic behavior, while Chainer and T