Abstract:Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent subspace updates lead to significant training time overhead. Moreover, GaLore offers minimal improvements in accuracy and efficiency compared to LoRA in more accessible fine-tuning scenarios. To address these limitations, we introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection, surpassing the benefits of GaLore. Our method is based on two key observations: (i) the gradient subspace exhibits diverse properties, with some layers converging early in training while others are subject to frequent changes; (ii) the projection matrices are highly resilient to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on its convergence statistics, achieving comparable performance while significantly reducing the number of SVD operations. We maintain the projection matrices in INT4 format and weights in INT8 format, incorporating stochastic rounding to capture accumulated gradient information. This approach enables a high-precision training trajectory using only low-precision weights. We demonstrate that Q-GaLore achieves highly competitive performance with exceptional memory efficiency. At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory. At fine-tuning, it reduces memory consumption by up to 50% compared to LoRA and GaLore, while consistently outperforming QLoRA at the same memory cost.

What problem does this paper attempt to address?

This paper focuses on reducing memory usage when training large-scale language models (LLMs) because these models require a significant amount of memory due to the large number of parameters and related optimization states. Existing methods such as GaLore reduce memory usage through low-rank gradient projection, but it relies on time-consuming singular value decomposition (SVD) operations and frequent subspace updates, resulting in significant training time overhead. In addition, compared to LoRA, GaLore provides limited accuracy and efficiency improvements in accessible fine-tuning scenarios. To address these issues, the paper proposes Q-GaLore, a new approach that combines quantization and low-rank projection to further reduce memory usage and surpass the effectiveness of GaLore. Q-GaLore is based on two key observations: (i) gradient subspaces exhibit different convergence characteristics in different layers, with some layers converging early in the training process while others continue to change throughout; (ii) projection matrices have good adaptability to low-bit quantization and can be seamlessly quantized to 4-bit without sacrificing training quality. Using these insights, Q-GaLore adaptively updates subspaces based on convergence statistics of the gradient subspaces, reducing the number of SVD operations while maintaining performance. It quantizes weights and projection matrices into INT8 and INT4 formats, respectively, to achieve high-precision training trajectories while using only low-precision weights. In pre-training, Q-GaLore can train a 7B model from scratch on a single NVIDIA RTX 4060 Ti GPU with only 16GB of memory, demonstrating its excellent memory efficiency and practicality. In terms of fine-tuning, Q-GaLore reduces up to 50% of memory consumption while maintaining superior performance on tasks such as MMLU compared to QLoRA with the same memory cost as LoRA and GaLore. In summary, Q-GaLore addresses the memory efficiency issue in training large-scale language models through quantization and adaptive update strategies, improving training speed and reducing resource requirements, making training possible on a wider range of hardware configurations.

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks

Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning

Subspace Optimization for Large Language Models with Convergence Guarantees

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Low-Rank Quantization-Aware Training for LLMs

QLoRA: Efficient Finetuning of Quantized LLMs

From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

LQER: Low-Rank Quantization Error Reconstruction for LLMs

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Full Parameter Fine-tuning for Large Language Models with Limited Resources

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

Exploring Gradient Subspaces: Addressing and Overcoming LoRA's Limitations in Federated Fine-Tuning of Large Language Models

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

QuAILoRA: Quantization-Aware Initialization for LoRA

GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning