Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Yuxin Wang,Xueze Kang,Shaohuai Shi,Xin He,Zhenheng Tang,Xinglin Pan,Yang Zheng,Xiaoyu Wu,Amelie Chi Zhou,Bingsheng He,Xiaowen Chu

2024-08-20

Abstract:To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs).

Distributed, Parallel, and Cluster Computing,Performance

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue of fault recovery during large-scale model (LM) training, particularly in the context of Hybrid Parallelism (HP) training. Specifically: 1. **Limitations of Existing Methods**: - Existing asynchronous and memory checkpointing methods are primarily designed for data parallel (DP) training or small-scale models, and they exhibit limitations when applied to large-scale models and hybrid parallel training. - In hybrid parallel training, existing memory checkpointing methods lead to severe resource contention issues, especially during parameter snapshotting and protection processes. 2. **Proposed New System**: - To overcome these limitations, the paper proposes a memory checkpointing system named **REFT**, which efficiently utilizes volatile host memory to protect snapshots, enabling rapid recovery. - REFT comprises two main components: REFT-save for memory checkpoint saving and REFT-load for memory checkpoint loading. 3. **Key Optimizations**: - **Efficient Snapshotting**: Introduces Hierarchical Asynchronous Snapshotting (HAS), which effectively utilizes device idle time to reduce resource contention in hybrid parallel training. - **Efficient and Reliable In-Memory Protecting**: Enhances the integrity of distributed checkpoints through methods such as Asynchronous Redundant Copying (ARC), Asynchronous Error Correction (AEC), and Asynchronous Optimizer Recalculation (AOR), with minimal additional overhead during snapshot redundancy. Through these methods, REFT significantly reduces the overhead of memory checkpointing, ensuring rapid and reliable large-scale model training and fault recovery in hybrid parallel training.

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Improving Bank-Level Parallelism for In-Memory Checkpointing in Hybrid Memory Systems

BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Self-Checkpoint

FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training

A Memory-efficient Hybrid Parallel Framework for Deep Neural Network Training

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations With Machine Learning

Convergence-aware optimal checkpointing for exploratory deep learning training jobs

An Efficient Checkpoint Strategy for Federated Learning on Heterogeneous Fault-Prone Nodes

A Multilevel Fault-Tolerance Technique for the DAG Data Driven Model

Hybrid Full/incremental Checkpoint/restart for MPI Jobs in HPC Environments

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems.

Accelerating the Training of Large Language Models Using Efficient Activation Rematerialization and Optimal Hybrid Parallelism.

Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems

HetHub: A Heterogeneous Distributed Hybrid Training System for Large-Scale Models