LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression

Ayush Kaushal,Tejas Vaidhya,Irina Rish

2023-09-25

Abstract:Low Rank Decomposition of matrix - splitting a large matrix into a product of two smaller matrix offers a means for compression that reduces the parameters of a model without sparsification, and hence delivering more speedup on modern hardware. Moreover, unlike quantization, the compressed linear layers remain fully differentiable and all the parameters trainable, while being able to leverage the existing highly efficient kernels over floating point matrices. We study the potential to compress Large Language Models (LLMs) for monolingual Code generation via Low Rank Decomposition (LoRD) and observe that ranks for the linear layers in these models can be reduced by upto 39.58% with less than 1% increase in perplexity. We then use Low Rank Decomposition (LoRD) to compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single A100. The compressed models speeds up inference by up to 22.35% with just a single line of change in code over huggingface's implementation with pytorch backend. Low Rank Decomposition (LoRD) models remain compatible with state of the art near-lossless quantization method such as SpQR, which allows leveraging further compression gains of quantization. Lastly, QLoRA over Low Rank Decomposition (LoRD) model further reduces memory requirements by as much as 21.2% over vanilla QLoRA while offering similar gains from parameter efficient fine tuning. Our work shows Low Rank Decomposition (LoRD) as a promising new paradigm for LLM compression.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem this paper attempts to address is how to compress large-scale language models (LLMs) for single-language code generation through Low-Rank Decomposition (LoRD), thereby reducing the number of model parameters, improving inference speed, and maintaining model performance without retraining. Specifically, the paper focuses on the following aspects: 1. **Model Compression**: Using low-rank decomposition techniques to decompose large matrices into two smaller matrices, thereby reducing the number of model parameters. 2. **Inference Acceleration**: Improving the inference speed of the model on modern hardware by reducing the number of parameters. 3. **Performance Maintenance**: Ensuring that the model's performance (such as perplexity and HumanEval scores) does not significantly degrade during compression. 4. **Combination with Existing Techniques**: Exploring how to combine low-rank decomposition with techniques such as Quantization and Parameter Efficient Fine-Tuning to further enhance compression effects and performance. The paper validates the feasibility of these goals through experiments, achieving significant results particularly on the StarCoder and CodeGen models. For example, through low-rank decomposition, the StarCoder 16B model can be compressed to 13.2B parameters without a drop in HumanEval Pass@1 scores, and in some cases, even a slight improvement. Additionally, the paper demonstrates a significant improvement in inference speed for the low-rank decomposition model, with a maximum increase of up to 22.35%. Overall, the paper proposes a new model compression paradigm—Low-Rank Decomposition (LoRD)—and validates its effectiveness on large-scale code generation models.

LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression

Compressing Large Language Models using Low Rank and Low Precision Decomposition

LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

NOLA: Compressing LoRA using Linear Combination of Random Basis

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

LCQ: Low-Rank Codebook based Quantization for Large Language Models

Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models

Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy

LoRA-Mini : Adaptation Matrices Decomposition and Selective Training

LoLCATs: On Low-Rank Linearizing of Large Language Models

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

LQER: Low-Rank Quantization Error Reconstruction for LLMs

Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs

Data-freeWeight Compress and Denoise for Large Language Models

Low-Rank Correction for Quantized LLMs

SDQ: Sparse Decomposed Quantization for LLM Inference