Abstract:Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of high memory requirements in the training process of large - language models (LLMs), especially when using the AdamW optimizer. Specifically, although the AdamW optimizer performs excellently in terms of stability and performance, it needs to maintain the first and second moments, which makes its memory consumption almost three times the size of the model parameters. For large - scale models, such as LLMs with billions of parameters, this memory overhead becomes extremely large, resulting in a memory bottleneck during the training process. #### Main problems include: 1. **Excessive memory consumption**: - When using the AdamW optimizer, the memory consumption is huge, and even for a single batch of data, it will occupy a large amount of memory. For example, training a 7 - billion - parameter LLaMA model requires at least 58GB of memory, of which 28GB is used for the optimization state of AdamW. - For larger models, such as GPT - 3 (175 billion parameters), the model itself alone requires 700GB of memory, and the optimization state of AdamW requires as much as 1.4TB of memory. 2. **Limited training scalability**: - Memory limitations force researchers to either use more or higher - end GPUs or reduce the batch size, which respectively limit the scalability and throughput of training. - Expanding the training cluster will introduce additional overheads in communication and infrastructure, and a smaller batch size will affect the training throughput. 3. **Limitations of existing methods**: - Existing memory - efficient optimizers (such as GaLore, Fira, etc.) rely on expensive SVD operations or have significant compromises in performance and cannot fully resolve the contradiction between memory and performance. #### Solutions: To solve the above problems, the author proposes a new optimizer - APOLLO (Approximated Gradient Scaling for Memory Efficient LLM Optimization). APOLLO achieves memory - efficient and superior - performance optimization in the following ways: 1. **Structured learning rate update**: - Redesign the learning rate update rule of AdamW, simplifying it from element - by - element update to channel - level or tensor - level update. This method reduces the computational overhead and the memory requirements. 2. **Approximate gradient scaling in low - rank auxiliary space**: - Use random projection to map the gradient to a low - rank space, thereby significantly reducing memory consumption. In this way, APOLLO can reduce the memory usage to a level close to that of SGD while maintaining performance. 3. **Extremely memory - efficient variant APOLLO - Mini**: - Further compress the optimizer state and use rank - 1 space for tensor - level gradient scaling, achieving the memory cost at the SGD level while outperforming AdamW in performance. Through these innovations, APOLLO can not only significantly reduce memory usage but also improve training throughput and scalability, especially in the case of limited resources, making the training of large - scale language models more feasible.

APOLLO: SGD-like Memory, AdamW-level Performance

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training

Adam-mini: Use Fewer Learning Rates To Gain More

Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast, Memory-Reduced Training with Convergence Guarantees

Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator

Mini-batch Coresets for Memory-efficient Training of Large Language Models

Memorize Step by Step: Efficient Long-Context Prefilling with Incremental Memory and Decremental Chunk

Dynamic Memory Based Adaptive Optimization

LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks

ProTrain: Efficient LLM Training via Memory-Aware Techniques

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training