ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Borui Wan,Mingji Han,Yiyao Sheng,Yanghua Peng,Haibin Lin,Mofan Zhang,Zhichao Lai,Menghan Yu,Junda Zhang,Zuquan Song,Xin Liu,Chuan Wu
2024-10-10
Abstract:Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallelism configurations. In addition, saved checkpoints are dispatched to evaluation tasks or transferred across different training stages (e.g., from pre-training to post-training). All these scenarios require resharding distributed checkpoints from one parallelism to another. In production, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales. A high-performance checkpointing system is needed to enable efficient checkpoint management at scale. This paper presents ByteCheckpoint, an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint employs a parallelism-agnostic checkpoint representation that enables efficient load-time checkpoint resharding. ByteCheckpoint advocates a generic checkpoint saving/loading workflow to accommodate multiple training frameworks and support different storage backends. To ensure high I/O efficiency, we take a full-stack approach to optimize saving/loading plan generation, critical stages of checkpointing pipelines, and irregular tensor processing required by resharding. To guarantee the scalability of ByteCheckpoint in large-scale training, we enhance the storage system to efficiently handle high volumes of checkpointing I/O requests, devise communication optimizations within the checkpointing workflow, and introduce a suite of monitoring tools to analyze performance and detect bottlenecks. Compared to existing open-source checkpointing systems [40, 46], ByteCheckpoint significantly reduces runtime checkpoint stalls, achieving an average reduction of 54.20x. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.
Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the challenges of checkpoint management in the development process of large - scale foundation models (LFMs). Specifically, the paper proposes ByteCheckpoint, an industrial - level checkpoint system for large - scale LFM training. ByteCheckpoint mainly solves the following problems: 1. **Efficient checkpoint re - sharding**: - In different stages and tasks of LFM training, due to changes in resource allocation (such as an increase or decrease in the number of GPUs), adjustments in training configurations (such as changes in context length), and the application of system optimization techniques, checkpoint re - sharding is often required. Traditional offline re - sharding methods are not only time - consuming but also have high maintenance costs. ByteCheckpoint significantly reduces additional overhead and improves the end - to - end effective training time ratio (ETTR) by automatically performing re - sharding during loading. 2. **Support for multiple training frameworks and storage back - ends**: - Different users may choose different training frameworks (such as Megatron - LM, PyTorch FSDP, DDP, etc.) and storage back - ends (such as local disks, HDFS, NAS, etc.). ByteCheckpoint provides a general workflow that can adapt to these different frameworks and back - ends, avoiding the complexity and high maintenance costs of customized implementation for each framework. 3. **Efficient I/O performance and scalability**: - Large - scale LFM training involves a large number of I/O operations, especially when saving and loading distributed checkpoints. ByteCheckpoint ensures efficient I/O performance through full - stack optimization techniques, including balanced and zero - redundancy plan generation, fully asynchronous execution pipelines, and irregular tensor processing. In addition, ByteCheckpoint also optimizes the storage system to handle a large number of I/O requests and designs monitoring tools to analyze performance and detect bottlenecks, thereby ensuring the scalability of the system. ### Summary ByteCheckpoint effectively solves the complexity and efficiency problems of checkpoint management in large - scale LFM training by providing a unified architecture, decoupled checkpoint representation, a general workflow, and full - stack optimization techniques. Compared with existing open - source checkpoint systems, ByteCheckpoint shows significant advantages in reducing checkpoint stalling time and increasing save - load speeds.