Abstract:The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces $\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$. Here, $\mathbf{L}$ and $\mathbf{R}$ are low rank factors, and the entries of $\mathbf{Q}$, $\mathbf{L}$ and $\mathbf{R}$ are quantized. The model is compressed by substituting each layer with its $\mathbf{Q} + \mathbf{L}\mathbf{R}$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, $\mathbf{L}$ and $\mathbf{R}$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. $\rm CALDERA$ obtains this decomposition by formulating it as an optimization problem $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$, where $\mathbf{X}$ is the calibration data, and $\mathbf{Q}, \mathbf{L}, \mathbf{R}$ are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of $\rm CALDERA$ are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-$2$ $7$B/$13B$/$70$B and LlaMa-$3$ $8$B models using $\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than $2.5$ bits per parameter. The implementation is available at: <a class="link-external link-https" href="https://github.com/pilancilab/caldera" rel="external noopener nofollow">this https URL</a>.

Adaptive Feature-based Low-Rank Compression of Large Language Models Via Bayesian Optimization

Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

Data-freeWeight Compress and Denoise for Large Language Models

A Survey on Model Compression for Large Language Models

Bayesian Low-rank Adaptation for Large Language Models

Low-Rank Prune-And-Factorize for Language Model Compression

Aggressive Post-Training Compression on Extremely Large Language Models

Ranking LLMs by compression

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

Large Language Models to Enhance Bayesian Optimization

Training-Free Bayesianization for Low-Rank Adapters of Large Language Models

Search for Efficient Large Language Models

SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism

LCQ: Low-Rank Codebook based Quantization for Large Language Models

Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

Compressing Large Language Models using Low Rank and Low Precision Decomposition

Large Language Model Compression with Neural Architecture Search