Abstract:The advancements in Large Language Models (LLMs) have been hindered by their substantial sizes, which necessitate LLM compression methods for practical deployment. Singular Value Decomposition (SVD) offers a promising solution for LLM compression. However, state-of-the-art SVD-based LLM compression methods have two key limitations: truncating smaller singular values may lead to higher compression loss, and the lack of update on the compressed weight after SVD truncation. In this work, we propose SVD-LLM, a new SVD-based LLM compression method that addresses the limitations of existing methods. SVD-LLM incorporates a truncation-aware data whitening strategy to ensure a direct mapping between singular values and compression loss. Moreover, SVD-LLM adopts a layer-wise closed-form model parameter update strategy to compensate for accuracy degradation under high compression ratios. We evaluate SVD-LLM on a total of 10 datasets and eight models from three different LLM families at four different scales. Our results demonstrate the superiority of SVD-LLM over state-of-the-arts, especially at high model compression ratios.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are two key limitations in the compression process of large - scale language models (LLMs): 1. **Truncating smaller singular values may lead to higher compression losses**: Traditional singular value decomposition (SVD) - based LLM compression methods may lead to higher compression losses when truncating smaller singular values. This is because these methods do not establish a direct relationship between singular values and compression losses, and thus are not precise enough when choosing which singular values to truncate. 2. **Lack of update of compressed weights after SVD truncation**: As the model compression ratio increases, the number of singular values to be truncated also increases. To compensate for the decrease in accuracy caused by truncating a large number of singular values, it is necessary to update the remaining parameters after compression. However, existing SVD - based LLM compression methods do not take this into account, and thus cannot effectively compensate for the decrease in accuracy at high compression ratios. To solve these problems, the paper proposes a new SVD - based LLM compression method - SVD - LLM. SVD - LLM overcomes the above limitations through the following two key techniques: 1. **Truncation - aware data whitening**: SVD - LLM introduces a truncation - aware data whitening technique, which ensures a direct mapping between singular values and model compression losses. This allows for more precise identification of which singular values should be truncated to minimize compression losses. 2. **Layer - by - layer closed - form model parameter update**: To compensate for the decrease in accuracy at high compression ratios, SVD - LLM adopts a layer - by - layer closed - form model parameter update strategy to update the compressed weights layer by layer. Through these improvements, SVD - LLM shows better performance than existing methods on multiple datasets and LLMs of different scales, especially at high compression ratios.

SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Compressing Large Language Models by Joint Sparsification and Quantization

A Survey on Model Compression for Large Language Models

Aggressive Post-Training Compression on Extremely Large Language Models

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

Effective SVD-Based Deep Network Compression for Automatic Speech Recognition.

SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs

SDQ: Sparse Decomposed Quantization for LLM Inference

Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference

Foundations of Large Language Model Compression -- Part 1: Weight Quantization

Data-freeWeight Compress and Denoise for Large Language Models

SqueezeLLM: Dense-and-Sparse Quantization

Lillama: Large Language Models Compression via Low-Rank Feature Distillation

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other

SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching