Abstract:Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallelism configurations. In addition, saved checkpoints are dispatched to evaluation tasks or transferred across different training stages (e.g., from pre-training to post-training). All these scenarios require resharding distributed checkpoints from one parallelism to another. In production, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales. A high-performance checkpointing system is needed to enable efficient checkpoint management at scale. This paper presents ByteCheckpoint, an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint employs a parallelism-agnostic checkpoint representation that enables efficient load-time checkpoint resharding. ByteCheckpoint advocates a generic checkpoint saving/loading workflow to accommodate multiple training frameworks and support different storage backends. To ensure high I/O efficiency, we take a full-stack approach to optimize saving/loading plan generation, critical stages of checkpointing pipelines, and irregular tensor processing required by resharding. To guarantee the scalability of ByteCheckpoint in large-scale training, we enhance the storage system to efficiently handle high volumes of checkpointing I/O requests, devise communication optimizations within the checkpointing workflow, and introduce a suite of monitoring tools to analyze performance and detect bottlenecks. Compared to existing open-source checkpointing systems [40, 46], ByteCheckpoint significantly reduces runtime checkpoint stalls, achieving an average reduction of 54.20x. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.

AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems.

Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Efficient Incremental Checkpoint Based on Hybrid Page

Optimizing Checkpoint Restart with Data Deduplication

DCU-CHK: Checkpointing for Large-Scale CPU-DCU Heterogeneous Computing Systems

Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems

Designing an Adaptive Application-Level Checkpoint Management System for Malleable MPI Applications

An I/O-efficient and Adaptive Fault-Tolerant Framework for Distributed Graph Computations

VELOC: VEry Low Overhead Checkpointing in the Age of Exascale

Hybrid Full/incremental Checkpoint/restart for MPI Jobs in HPC Environments

A Study on the Method of Adaptive Time Intervals Checkpointing

Understanding the Impact of BPRAM on Incremental Checkpoint

Self-Checkpoint

Improving Performance of Iterative Methods by Lossy Checkponting

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations With Machine Learning

Design and Implementation of a Low-Overhead File Checkpointing Approach

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

A Method of Self-Adaptive Pre-Copy Container Checkpoint

Optimizing Checkpoint for Scientific Simulations.