Abstract:Modern Large Language Models (LLMs) are composed of matrices with billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Being significantly large, such matrices can often be expressed in low-rank format with potential to relax resource requirements. Unlike prior works which focus on developing novel matrix decomposition algorithms, in this work we first study the emergence of low-rank structures across matrices within different layers of LLMs and establish a consequential relationship between the gradient dynamics and emerging low-rank expressiveness of matrices. Our findings reveal that different layers exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. WeLore capitalizes the heavy-tail distribution of singular values to identify a suitable rank reduction ratio for matrices within LLMs. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient perspective and extensive experiments illustrate that LRCs tend to have better finetuning capabilities and can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. For example, finetuning a 50\% compressed LLaMa-2 7B model using only a fraction of parameters in LRCs (WeLore) can outperform its full finetuning with ~3x better throughput and ~0.6x GPU requirement. Our codes are available at \url{<a class="link-external link-https" href="https://github.com/VITA-Group/welore" rel="external noopener nofollow">this https URL</a>}

GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

Subspace Optimization for Large Language Models with Convergence Guarantees

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks

Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning

Gradient Weight-normalized Low-rank Projection for Efficient LLM Training

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

Memory-Efficient LLM Training with Online Subspace Descent

Gaussian Stochastic Weight Averaging for Bayesian Low-Rank Adaptation of Large Language Models

LoRA ensembles for large language model fine-tuning

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models