GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Jiawei Zhao,Zhenyu Zhang,Beidi Chen,Zhangyang Wang,Anima Anandkumar,Yuandong Tian

2024-06-03

Abstract:Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

Machine Learning

What problem does this paper attempt to address?

This paper mainly discusses the memory challenges faced when training large language models (LLMs), especially the increase in weight and optimizer state. Existing memory reduction methods, such as Low-Rank Adaptation (LoRA), address this issue by adding trainable low-rank matrices to the pre-trained weights at each layer. However, these methods often perform poorly during pre-training and fine-tuning stages because they restrict parameter search to low-rank subspaces and alter training dynamics, potentially requiring full-rank warm-up. The paper proposes a new strategy called Gradient Low-Rank Projection (GaLore), which allows for full-parameter learning while being more memory-efficient than common low-rank adaptation methods. This approach reduces the memory usage of the optimizer state by utilizing the slowly varying low-rank structure of weight matrix gradients, without compromising training efficiency and performance. Experimental results show that compared to LoRA, GaLore significantly reduces memory usage during pre-training with LLaMA 1B and 7B architectures and fine-tuning of RoBERTa, especially with the 8-bit version. The paper also demonstrates for the first time the feasibility of pre-training a 7B model on a consumer-grade GPU with 24GB of memory (such as NVIDIA RTX 4090), without the need for model parallelism, checkpoints, or offloading strategies. Furthermore, when fine-tuning pre-trained LLMs on the GLUE tasks, GaLore achieves results comparable to or better than existing low-rank methods. In summary, GaLore, as proposed in the paper, is a new training strategy that addresses memory efficiency issues in training large language models, enabling more efficient training of large-scale models.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks

Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

Full Parameter Fine-tuning for Large Language Models with Limited Resources

From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

ProTrain: Efficient LLM Training via Memory-Aware Techniques

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

Mini-batch Coresets for Memory-efficient Training of Large Language Models

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections

LoRA: Low-Rank Adaptation of Large Language Models

LoRA-GA: Low-Rank Adaptation with Gradient Approximation

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning

Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs