From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

Ajay Jaiswal,Lu Yin,Zhenyu Zhang,Shiwei Liu,Jiawei Zhao,Yuandong Tian,Zhangyang Wang
2024-07-16
Abstract:Modern Large Language Models (LLMs) are composed of matrices with billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Being significantly large, such matrices can often be expressed in low-rank format with potential to relax resource requirements. Unlike prior works which focus on developing novel matrix decomposition algorithms, in this work we first study the emergence of low-rank structures across matrices within different layers of LLMs and establish a consequential relationship between the gradient dynamics and emerging low-rank expressiveness of matrices. Our findings reveal that different layers exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. WeLore capitalizes the heavy-tail distribution of singular values to identify a suitable rank reduction ratio for matrices within LLMs. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient perspective and extensive experiments illustrate that LRCs tend to have better finetuning capabilities and can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. For example, finetuning a 50\% compressed LLaMa-2 7B model using only a fraction of parameters in LRCs (WeLore) can outperform its full finetuning with ~3x better throughput and ~0.6x GPU requirement. Our codes are available at \url{<a class="link-external link-https" href="https://github.com/VITA-Group/welore" rel="external noopener nofollow">this https URL</a>}
Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses the low-rank structure problem of weight matrices in large-scale language models (LLMs) and how to effectively utilize these low-rank characteristics to compress models and optimize fine-tuning processes. The study found that weight matrices of different layers exhibit different low-rank structures during pre-training, and this structure is related to gradient dynamics. The paper proposes the Weight Low-Rank Projection (WeLore) method, which selects an appropriate compression ratio based on the singular value distribution of weight matrices and divides the weights into Low-Rank Components (LRCs) and Non-Low-Rank Components (N-LRCs). LRCs can be efficiently compressed and retain good performance during fine-tuning, while N-LRCs require little or no compression. WeLore is not only a compression technique but also enables memory-efficient fine-tuning by updating only LRCs through backpropagation, reducing the number of parameters, improving processing speed, and reducing GPU memory requirements. Experiments show that WeLore outperforms full fine-tuning in terms of compression and fine-tuning, especially in large-scale models. For example, on a 50% compressed LLaMa-2 7B model, WeLore achieves better performance than full fine-tuning with only about 35% trainable parameters, reducing throughput by approximately 3 times and GPU requirements by approximately 0.6 times.