Abstract:Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.

What problem does this paper attempt to address?

The paper primarily addresses the checkpointing issue in large-scale distributed deep learning model training by proposing a new solution—Universal Checkpointing (UCP). Existing checkpointing methods have limitations when hardware resources change dynamically or fail, especially when the model parallelism strategy changes, making it difficult to effectively resume training. To address these issues, the paper introduces the UCP technique, whose core objective is to achieve an efficient and flexible checkpoint creation and recovery mechanism to support training resumption under different parallel strategies and hardware configurations. ### Problems the Paper Attempts to Solve 1. **Flexibility Issue**: Current distributed training frameworks (such as Megatron-LM, DeepSpeed, etc.) provide various parallel strategies to accelerate the training of large-scale deep learning models. However, these frameworks assume that GPU resources are statically allocated at the start of training and lack the capability to resume training under different parallel strategies and hardware configurations. This leads to inefficiencies when hardware resources change (e.g., hardware failures) or when there is a need to adjust the number of GPUs, making it difficult for existing methods to effectively continue the training process. 2. **Efficiency Issue**: Existing checkpointing methods are inefficient in handling large-scale distributed training. Merging distributed model states at extreme scales significantly slows down the training speed, and these methods are not usable across different configurations. ### Solution Overview UCP addresses the above issues through the following key mechanisms: 1. **Universal Checkpoint Format**: Defines a unified checkpoint format that includes an integrated representation of each model parameter and metadata to map parameter fragments to training nodes under any model parallel configuration. 2. **Universal Checkpoint Language**: Introduces a simple yet powerful specification language that can convert distributed checkpoints into the universal checkpoint format. Through these mechanisms, UCP achieves efficient and flexible checkpoint creation and recovery without compromising model quality. Additionally, UCP enhances model resilience to hardware failures and reduces training time by leveraging elastic capacity. ### Experimental Results The paper validates the effectiveness and generality of UCP through a series of experiments. The experimental results show that UCP supports seamless transitions from one parallel strategy to another while maintaining the consistency of model training. Specifically, UCP can resume training under different hardware configurations and parallel strategies without affecting model quality. Moreover, UCP does not introduce additional checkpoint saving overhead; training with UCP does not increase extra GPU hours compared to conventional distributed training in the absence of node failures. When node failures occur, UCP's conversion and checkpoint loading add only minimal overhead. In summary, the UCP proposed in this paper provides an effective solution to the checkpointing issue in large-scale distributed deep learning model training.

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

DCU-CHK: Checkpointing for Large-Scale CPU-DCU Heterogeneous Computing Systems

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems.

Convergence-aware optimal checkpointing for exploratory deep learning training jobs

MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training

Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training

Parallelizing Checkpoint for Faster Fault Tolerance

Optimizing Checkpoint Restart with Data Deduplication

On Efficient Constructions of Checkpoints

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations With Machine Learning

Optimal Multi-Level Interval-based Checkpointing for Exascale Stream Processing Systems

An Efficient Checkpoint Strategy for Federated Learning on Heterogeneous Fault-Prone Nodes

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

Distributed Training Optimization for DCU

Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach