Abstract:Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.
What problem does this paper attempt to address?
This paper attempts to address the high storage and computational cost issues faced by large language models (LLMs) during deployment. Specifically, the paper proposes a new weight quantization method - low - rank codebook quantization (LCQ), aiming to reduce the storage and computational costs of LLMs while maintaining high model accuracy. Existing weight quantization methods usually use rank - one codebooks for quantization, which will lead to significant accuracy loss at high compression ratios. LCQ improves the quantized model accuracy by using low - rank codebooks with a rank greater than one, and the additional storage cost is negligible.
### Paper Background
- **Large Language Models (LLMs)**: Such as GPT, GLM, LLaMA and LLava, etc. These models perform excellently in multiple tasks such as natural language understanding and generation, image, video and speech processing, but their large number of parameters leads to high storage and computational costs, becoming a challenge in practical applications.
- **Weight Quantization**: By representing high - bit - width weights as low - bit - width values, the storage and computational costs of the model can be reduced, thus facilitating the practical deployment of the model. However, simple weight quantization will lead to a decline in accuracy, especially at high compression ratios.
### Existing Problems
- **Limitations of Existing Methods**: Most existing weight quantization methods use rank - one codebooks for quantization. Although they can reduce storage costs, their representational ability is limited, resulting in significant accuracy loss at low - bit - width.
### Proposed Method
- **Low - rank Codebook Quantization (LCQ)**:
- **Low - rank Codebook**: LCQ uses low - rank codebooks with a rank greater than one for quantization, improving the representational ability of the model.
- **Gradient Optimization Algorithm**: A gradient - based optimization algorithm is proposed to optimize the codebook parameters.
- **Dual - Quantization Strategy**: A dual - quantization strategy is adopted to compress the codebook parameters, further reducing the storage cost.
### Experimental Results
- **Natural Language Generation Tasks**: On pre - trained models such as OPT and LLaMA, the performance of LCQ at 2 - bit quantization is significantly better than that of the baseline method.
- **Zero - shot Tasks**: On datasets such as PIQA, WinoGrande and ARC_easy, the performance of LCQ on zero - shot tasks is also better than that of the baseline method, verifying that the LCQ method is not only effective on the training dataset but also can maintain the generalization ability of the model.
### Hyperparameter Sensitivity Analysis
- **Initialization Method**: Different initialization methods have little impact on the performance of LCQ, indicating that LCQ is robust to the initialization method.
- **Number of Training Epochs**: 10 training epochs can achieve results comparable to 40 training epochs. Therefore, 10 training epochs are adopted by default to accelerate the training process.
### Conclusion
LCQ effectively solves the accuracy loss problem of LLMs at high compression ratios by using low - rank codebook quantization while maintaining low storage costs, providing a new solution for the practical deployment of LLMs.