Abstract:Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.

What problem does this paper attempt to address?

This paper attempts to address the high storage and computational cost issues faced by large language models (LLMs) during deployment. Specifically, the paper proposes a new weight quantization method - low - rank codebook quantization (LCQ), aiming to reduce the storage and computational costs of LLMs while maintaining high model accuracy. Existing weight quantization methods usually use rank - one codebooks for quantization, which will lead to significant accuracy loss at high compression ratios. LCQ improves the quantized model accuracy by using low - rank codebooks with a rank greater than one, and the additional storage cost is negligible. ### Paper Background - **Large Language Models (LLMs)**: Such as GPT, GLM, LLaMA and LLava, etc. These models perform excellently in multiple tasks such as natural language understanding and generation, image, video and speech processing, but their large number of parameters leads to high storage and computational costs, becoming a challenge in practical applications. - **Weight Quantization**: By representing high - bit - width weights as low - bit - width values, the storage and computational costs of the model can be reduced, thus facilitating the practical deployment of the model. However, simple weight quantization will lead to a decline in accuracy, especially at high compression ratios. ### Existing Problems - **Limitations of Existing Methods**: Most existing weight quantization methods use rank - one codebooks for quantization. Although they can reduce storage costs, their representational ability is limited, resulting in significant accuracy loss at low - bit - width. ### Proposed Method - **Low - rank Codebook Quantization (LCQ)**: - **Low - rank Codebook**: LCQ uses low - rank codebooks with a rank greater than one for quantization, improving the representational ability of the model. - **Gradient Optimization Algorithm**: A gradient - based optimization algorithm is proposed to optimize the codebook parameters. - **Dual - Quantization Strategy**: A dual - quantization strategy is adopted to compress the codebook parameters, further reducing the storage cost. ### Experimental Results - **Natural Language Generation Tasks**: On pre - trained models such as OPT and LLaMA, the performance of LCQ at 2 - bit quantization is significantly better than that of the baseline method. - **Zero - shot Tasks**: On datasets such as PIQA, WinoGrande and ARC_easy, the performance of LCQ on zero - shot tasks is also better than that of the baseline method, verifying that the LCQ method is not only effective on the training dataset but also can maintain the generalization ability of the model. ### Hyperparameter Sensitivity Analysis - **Initialization Method**: Different initialization methods have little impact on the performance of LCQ, indicating that LCQ is robust to the initialization method. - **Number of Training Epochs**: 10 training epochs can achieve results comparable to 40 training epochs. Therefore, 10 training epochs are adopted by default to accelerate the training process. ### Conclusion LCQ effectively solves the accuracy loss problem of LLMs at high compression ratios by using low - rank codebook quantization while maintaining low storage costs, providing a new solution for the practical deployment of LLMs.

LCQ: Low-Rank Codebook based Quantization for Large Language Models

LQER: Low-Rank Quantization Error Reconstruction for LLMs

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Low-Rank Correction for Quantized LLMs

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

Extreme Compression of Large Language Models via Additive Quantization

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Channel-Wise Mixed-Precision Quantization for Large Language Models

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Foundations of Large Language Model Compression -- Part 1: Weight Quantization

RPTQ: Reorder-based Post-training Quantization for Large Language Models

Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

SqueezeLLM: Dense-and-Sparse Quantization

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression