Abstract:Modern Large Language Models (LLMs) are composed of matrices with billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Being significantly large, such matrices can often be expressed in low-rank format with potential to relax resource requirements. Unlike prior works which focus on developing novel matrix decomposition algorithms, in this work we first study the emergence of low-rank structures across matrices within different layers of LLMs and establish a consequential relationship between the gradient dynamics and emerging low-rank expressiveness of matrices. Our findings reveal that different layers exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. WeLore capitalizes the heavy-tail distribution of singular values to identify a suitable rank reduction ratio for matrices within LLMs. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient perspective and extensive experiments illustrate that LRCs tend to have better finetuning capabilities and can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. For example, finetuning a 50\% compressed LLaMa-2 7B model using only a fraction of parameters in LRCs (WeLore) can outperform its full finetuning with ~3x better throughput and ~0.6x GPU requirement. Our codes are available at \url{<a class="link-external link-https" href="https://github.com/VITA-Group/welore" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

This paper mainly discusses the low-rank structure problem of weight matrices in large-scale language models (LLMs) and how to effectively utilize these low-rank characteristics to compress models and optimize fine-tuning processes. The study found that weight matrices of different layers exhibit different low-rank structures during pre-training, and this structure is related to gradient dynamics. The paper proposes the Weight Low-Rank Projection (WeLore) method, which selects an appropriate compression ratio based on the singular value distribution of weight matrices and divides the weights into Low-Rank Components (LRCs) and Non-Low-Rank Components (N-LRCs). LRCs can be efficiently compressed and retain good performance during fine-tuning, while N-LRCs require little or no compression. WeLore is not only a compression technique but also enables memory-efficient fine-tuning by updating only LRCs through backpropagation, reducing the number of parameters, improving processing speed, and reducing GPU memory requirements. Experiments show that WeLore outperforms full fine-tuning in terms of compression and fine-tuning, especially in large-scale models. For example, on a 50% compressed LLaMa-2 7B model, WeLore achieves better performance than full fine-tuning with only about 35% trainable parameters, reducing throughput by approximately 3 times and GPU requirements by approximately 0.6 times.

From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Learning on LoRAs: GL-Equivariant Processing of Low-Rank Weight Spaces for Large Finetuned Models

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

InRank: Incremental Low-Rank Learning

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

LoTR: Low Tensor Rank Weight Adaptation

LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning

Data-freeWeight Compress and Denoise for Large Language Models

BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks

CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models

Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation

LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

Low-Rank Interconnected Adaptation across Layers

NOLA: Compressing LoRA using Linear Combination of Random Basis