Abstract:Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent subspace updates lead to significant training time overhead. Moreover, GaLore offers minimal improvements in accuracy and efficiency compared to LoRA in more accessible fine-tuning scenarios. To address these limitations, we introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection, surpassing the benefits of GaLore. Our method is based on two key observations: (i) the gradient subspace exhibits diverse properties, with some layers converging early in training while others are subject to frequent changes; (ii) the projection matrices are highly resilient to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on its convergence statistics, achieving comparable performance while significantly reducing the number of SVD operations. We maintain the projection matrices in INT4 format and weights in INT8 format, incorporating stochastic rounding to capture accumulated gradient information. This approach enables a high-precision training trajectory using only low-precision weights. We demonstrate that Q-GaLore achieves highly competitive performance with exceptional memory efficiency. At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory. At fine-tuning, it reduces memory consumption by up to 50% compared to LoRA and GaLore, while consistently outperforming QLoRA at the same memory cost.

BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

LoRA-Mini : Adaptation Matrices Decomposition and Selective Training

Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

Full Parameter Fine-tuning for Large Language Models with Limited Resources

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

Mini-batch Coresets for Memory-efficient Training of Large Language Models

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Memory-Efficient LLM Training with Online Subspace Descent

GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Exploring Gradient Subspaces: Addressing and Overcoming LoRA's Limitations in Federated Fine-Tuning of Large Language Models

ProTrain: Efficient LLM Training via Memory-Aware Techniques