Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Yuxin Wang,Xueze Kang,Shaohuai Shi,Xin He,Zhenheng Tang,Xinglin Pan,Yang Zheng,Xiaoyu Wu,Amelie Chi Zhou,Bingsheng He,Xiaowen Chu
2024-08-20
Abstract:To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs).
Distributed, Parallel, and Cluster Computing,Performance
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of fault recovery during large-scale model (LM) training, particularly in the context of Hybrid Parallelism (HP) training. Specifically: 1. **Limitations of Existing Methods**: - Existing asynchronous and memory checkpointing methods are primarily designed for data parallel (DP) training or small-scale models, and they exhibit limitations when applied to large-scale models and hybrid parallel training. - In hybrid parallel training, existing memory checkpointing methods lead to severe resource contention issues, especially during parameter snapshotting and protection processes. 2. **Proposed New System**: - To overcome these limitations, the paper proposes a memory checkpointing system named **REFT**, which efficiently utilizes volatile host memory to protect snapshots, enabling rapid recovery. - REFT comprises two main components: REFT-save for memory checkpoint saving and REFT-load for memory checkpoint loading. 3. **Key Optimizations**: - **Efficient Snapshotting**: Introduces Hierarchical Asynchronous Snapshotting (HAS), which effectively utilizes device idle time to reduce resource contention in hybrid parallel training. - **Efficient and Reliable In-Memory Protecting**: Enhances the integrity of distributed checkpoints through methods such as Asynchronous Redundant Copying (ARC), Asynchronous Error Correction (AEC), and Asynchronous Optimizer Recalculation (AOR), with minimal additional overhead during snapshot redundancy. Through these methods, REFT significantly reduces the overhead of memory checkpointing, ensuring rapid and reliable large-scale model training and fault recovery in hybrid parallel training.