Abstract:Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a ``memory wall'', i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect bandwidth and computational capabilities of CPUs and GPUs. In this paper, we leverage a key observation that the interleaving of the forward, backward and update phases generate fluctuations in the GPU memory utilization, which can be exploited to dynamically move a part of the optimizer state between the host and the GPU memory at each iteration. To this end, we design and implement \proj, a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU based on our proposed performance model that addresses the trade-off between data movement cost, acceleration on the GPUs vs the CPUs, and competition for shared resources. We integrate our approach with DeepSpeed and demonstrate 2.5$\times$ faster iterations over state-of-the-art approaches using extensive experiments.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper "Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading" aims to address the "memory wall" issue in the training of large-scale language models (LLMs). Specifically, as the number of parameters in Transformer models and LLMs increases dramatically, even with 3D parallelism techniques (pipeline, tensor, data parallelism) and the aggregation of memory across multiple GPUs, it is still impossible to fit all necessary data structures (model parameters, optimizer states, gradients, activations) into GPU memory. Existing solutions typically offload part or all of the optimizer states to host memory and perform hybrid CPU-GPU computation, but this leads to poor overlap between data transfer and computation, missing the opportunity to simultaneously utilize interconnect bandwidth and CPU, GPU computational capabilities. ### Solution To address the above issues, the authors propose a new technique called **Deep Optimizer States**. The core idea of this technique is to interleave the execution between the forward, backward, and update phases, leveraging the fluctuations in GPU memory utilization to dynamically move parts of the optimizer states between host memory and GPU memory in each iteration. Specific contributions include: 1. **Detailed study of training iteration behavior when offloading optimizer states**: It was found that despite subdividing large optimizer states into subgroups, computation remains efficient; GPU memory utilization significantly drops during the update phase; PCIe link utilization is low during the backward and update phases. 2. **Introduction of a series of key design principles**: Including interleaved GPU parameter update offloading, overlapping movement and execution of optimizer subgroups between GPU and CPU, efficient gradient placement and movement, and high-precision PCIe transfers to avoid costly on-the-fly precision conversions. 3. **Proposal of a new performance model**: To determine the GPU offloading frequency that maximizes overlap with CPU computation, and the development of an algorithm to perform interleaved CPU-GPU offloading. 4. **Design and implementation of Deep Optimizer States**: Integrating it into widely used LLM training runtimes (such as DeepSpeed and Megatron), emphasizing background parallelism and interaction with other existing components to accelerate hybrid CPU-GPU training. 5. **Extensive experimental evaluation of the implementation**: Demonstrating significant acceleration in end-to-end training time when training LLMs with up to 20B parameters in resource-constrained settings, and up to 3x speedup in model parameter update speed under various configurations. ### Main Innovations - **Dynamic Memory Management**: Leveraging fluctuations in GPU memory utilization to dynamically move parts of the optimizer states between host and GPU memory, thereby more efficiently utilizing memory resources. - **Interleaved Offloading**: Maximizing the overlap of data transfer and computation through interleaved CPU-GPU offloading, reducing bottlenecks. - **Performance Model**: Proposing a performance model to determine the optimal GPU offloading frequency to maximize computational efficiency. ### Limitations - **Dependence on Model and Optimizer Sharding**: Requires sharding the model and optimizer into smaller subgroups, which may not be available in some frameworks. - **Data Transfer and CPU Computation Speed Limitations**: Although part of the updates are accelerated, it is still limited by PCIe data transfer speed and CPU computation speed. In summary, this paper aims to overcome the memory and computation bottlenecks in existing methods by proposing the Deep Optimizer States technique, achieving more efficient LLM training.

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Improving Automatic Parallel Training Via Balanced Memory Workload Optimization

An Efficient 2D Method for Training Super-Large Deep Learning Models

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Multi-level Storage Optimization for Intermediate Data in AI Model Training

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Galvatron

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training

Efficient Large Models Fine-tuning on Commodity Servers Via Memory-balanced Pipeline Parallelism

AccTFM: an Effective Intra-Layer Model Parallelization Strategy for Training Large-Scale Transformer-Based Models.

Accelerating Large Language Model Training with Hybrid GPU-based Compression

PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training

Optimizing Layer-Fused Scheduling of Transformer Networks on Multi-accelerator Platforms

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

A Swap Dominated Tensor Re-Generation Strategy for Training Deep Learning Models

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs