Abstract:LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48$\times$ faster checkpointing and 2.2$\times$ faster end-to-end training runtime compared with the state-of-art checkpointing approaches.

What problem does this paper attempt to address?

The paper primarily addresses the checkpointing issues encountered when training large language models (LLMs) in high-performance computing (HPC) environments. Specifically, the paper tackles the following key issues: 1. **Importance of Checkpointing**: The training process of LLMs often requires long runtimes and involves numerous components, making the probability of unexpected events (such as hardware failures, software errors, etc.) relatively high. These events can negatively impact the training, such as causing global state corruption. To recover from these issues, it is crucial to frequently create checkpoints to roll back to a stable state. 2. **Limitations of Traditional Checkpointing Methods**: For larger LLMs, traditional checkpointing solutions that directly write model parameters and optimizer states to persistent storage incur significant I/O overhead, which hinders the training process. This overhead becomes even more unacceptable as the model size grows extremely large (e.g., containing billions or even trillions of parameters). 3. **Proposed Method**: To address the above issues, the paper proposes DataStates-LLM, a lazy asynchronous multi-level checkpointing technique. This technique leverages the observation that during each training iteration (i.e., during forward and backward propagation), the model parameters and optimizer states remain unchanged. Therefore, the model state can be copied to host memory during these stages without blocking the training process. This approach reduces the blocking time required for device-to-host I/O completion in each iteration. 4. **Key Technical Contributions**: - **Hybrid Flushing**: Asynchronously flushes model/optimizer fragments from the GPU to host memory. - **Lazy Copying**: Overlaps data transfer with the periods during training when model parameters and optimizer states remain unchanged. - **Multi-level Flushing**: Efficient data transfer strategy from host memory to persistent storage. - **Asynchronous Integration**: Asynchronous integration of model/optimizer fragments. 5. **Evaluation and Results**: The paper validates the effectiveness of the proposed method through a series of extensive experiments. The experimental results show that DataStates-LLM can significantly speed up checkpointing, up to 48 times faster, and reduce the overall training runtime by up to 2.2 times compared to the state-of-the-art checkpointing methods. In summary, the paper aims to address the checkpointing efficiency issues in large-scale language model training through an innovative lazy asynchronous multi-level checkpointing mechanism, achieving fast and scalable checkpoint processing, thereby significantly improving overall training efficiency.

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations With Machine Learning

ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Early Weight Averaging meets High Learning Rates for LLM Pre-training

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

FastPersist: Accelerating Model Checkpointing in Deep Learning

LLM-dCache: Improving Tool-Augmented LLMs with GPT-Driven Localized Data Caching

When Life gives you LLMs, make LLM-ADE: Large Language Models with Adaptive Data Engineering

New Solutions on LLM Acceleration, Optimization, and Application

Stateful Large Language Model Serving with Pensieve

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference