Abstract:Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallelism configurations. In addition, saved checkpoints are dispatched to evaluation tasks or transferred across different training stages (e.g., from pre-training to post-training). All these scenarios require resharding distributed checkpoints from one parallelism to another. In production, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales. A high-performance checkpointing system is needed to enable efficient checkpoint management at scale. This paper presents ByteCheckpoint, an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint employs a parallelism-agnostic checkpoint representation that enables efficient load-time checkpoint resharding. ByteCheckpoint advocates a generic checkpoint saving/loading workflow to accommodate multiple training frameworks and support different storage backends. To ensure high I/O efficiency, we take a full-stack approach to optimize saving/loading plan generation, critical stages of checkpointing pipelines, and irregular tensor processing required by resharding. To guarantee the scalability of ByteCheckpoint in large-scale training, we enhance the storage system to efficiently handle high volumes of checkpointing I/O requests, devise communication optimizations within the checkpointing workflow, and introduce a suite of monitoring tools to analyze performance and detect bottlenecks. Compared to existing open-source checkpointing systems [40, 46], ByteCheckpoint significantly reduces runtime checkpoint stalls, achieving an average reduction of 54.20x. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the challenges of checkpoint management in the development process of large - scale foundation models (LFMs). Specifically, the paper proposes ByteCheckpoint, an industrial - level checkpoint system for large - scale LFM training. ByteCheckpoint mainly solves the following problems: 1. **Efficient checkpoint re - sharding**: - In different stages and tasks of LFM training, due to changes in resource allocation (such as an increase or decrease in the number of GPUs), adjustments in training configurations (such as changes in context length), and the application of system optimization techniques, checkpoint re - sharding is often required. Traditional offline re - sharding methods are not only time - consuming but also have high maintenance costs. ByteCheckpoint significantly reduces additional overhead and improves the end - to - end effective training time ratio (ETTR) by automatically performing re - sharding during loading. 2. **Support for multiple training frameworks and storage back - ends**: - Different users may choose different training frameworks (such as Megatron - LM, PyTorch FSDP, DDP, etc.) and storage back - ends (such as local disks, HDFS, NAS, etc.). ByteCheckpoint provides a general workflow that can adapt to these different frameworks and back - ends, avoiding the complexity and high maintenance costs of customized implementation for each framework. 3. **Efficient I/O performance and scalability**: - Large - scale LFM training involves a large number of I/O operations, especially when saving and loading distributed checkpoints. ByteCheckpoint ensures efficient I/O performance through full - stack optimization techniques, including balanced and zero - redundancy plan generation, fully asynchronous execution pipelines, and irregular tensor processing. In addition, ByteCheckpoint also optimizes the storage system to handle a large number of I/O requests and designs monitoring tools to analyze performance and detect bottlenecks, thereby ensuring the scalability of the system. ### Summary ByteCheckpoint effectively solves the complexity and efficiency problems of checkpoint management in large - scale LFM training by providing a unified architecture, decoupled checkpoint representation, a general workflow, and full - stack optimization techniques. Compared with existing open - source checkpoint systems, ByteCheckpoint shows significant advantages in reducing checkpoint stalling time and increasing save - load speeds.

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems.

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

Design and Implementation of a Low-Overhead File Checkpointing Approach

Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

Understanding the Impact of BPRAM on Incremental Checkpoint

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems

Compiler aided checkpointing using crash-consistent data structures in NVMM systems

FastPersist: Accelerating Model Checkpointing in Deep Learning

DCU-CHK: Checkpointing for Large-Scale CPU-DCU Heterogeneous Computing Systems

VELOC: VEry Low Overhead Checkpointing in the Age of Exascale

A New Concurrent Checkpoint Mechanism for Embeded Multi-Core Systems.

Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems

Efficient Incremental Checkpoint Based on Hybrid Page

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations With Machine Learning

Improving Bank-Level Parallelism for In-Memory Checkpointing in Hybrid Memory Systems

Parallelizing Checkpoint for Faster Fault Tolerance

Optimizing Checkpoint Restart with Data Deduplication