Abstract:To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs).

Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems

CRState: In-Kernel Checkpoint/Restart of OpenCL Program Execution on GPU

PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation

CRState: Checkpoint/restart of OpenCL Program for In-Kernel Applications

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Parallel Transient Stability-Constrained Optimal Power Flow Using GPU as Coprocessor.

Self-Checkpoint

Hybrid Full/incremental Checkpoint/restart for MPI Jobs in HPC Environments

DyCuckoo: Dynamic Hash Tables on GPUs.

Hydrogen: Contention-Aware Hybrid Memory for Heterogeneous CPU-GPU Architectures

Transparent Checkpointing for OpenGL Applications on GPUs

A New Global Consistent Checkpoint Based On Os Virtualization

CRAC: an Automatic Assistant Compiler of Checkpoint/restart for OpenCL Program

GPU Lock-Free Hopscotch Hash Table

A user mode CPU–GPU scheduling framework for hybrid workloads

Transparent Checkpoint-Restart for Hardware-Accelerated 3D Graphics

AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency Analysis

Hybrid CPU–GPU execution support in the skeleton programming framework SkePU

Safe and Practical GPU Acceleration in TrustZone

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps