OneBit: Towards Extremely Low-bit Large Language Models

Yuzhuang Xu,Xu Han,Zonghan Yang,Shuo Wang,Qingfu Zhu,Zhiyuan Liu,Weidong Liu,Wanxiang Che
2024-10-28
Abstract:Model quantification uses low bit-width values to represent the weight matrices of existing models to be quantized, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, current quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. For this target, we introduce a 1-bit model compressing framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the quantization framework. Sufficient experimental results indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes when only using 1-bit weight matrices.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper attempts to address the issue of excessive computational and storage overhead faced by large language models (LLMs) during deployment. Specifically, the paper proposes a new framework called OneBit, which compresses the weight matrices of LLMs to 1-bit, enabling ultra-low bit-width deployment. This method aims to reduce the storage and computational overhead of the model while maintaining high performance. ### Main Contributions 1. **1-bit Model Architecture**: - A new 1-bit linear layer architecture is proposed, which introduces two floating-point value vectors to compensate for the precision loss during quantization. - This architecture demonstrates higher stability and efficiency during training and inference. 2. **Sign-Value Independent Decomposition (SVID)**: - A new parameter initialization method, SVID, is proposed, which decomposes high-bit matrices into low-bit matrices to better initialize the 1-bit model. - Experiments show that SVID-based initialization can improve model performance and convergence speed. 3. **Knowledge Distillation**: - A quantization-aware knowledge distillation method is used to transfer the capabilities of the original model to the 1-bit model. - Experimental results show that this method performs well across different scales of models, especially approaching the performance of FP16 models in large-scale models. ### Method Overview 1. **1-bit Linear Layer Architecture**: - The weight matrix \( W \) is represented as a sign matrix \( W_{\pm1} \) and two value vectors \( g \) and \( h \). - During training, floating-point value vectors \( g \) and \( h \) are used to compensate for precision loss. - During inference, the sign matrix \( W_{\pm1} \) is packed into INT1 format to reduce storage overhead. 2. **Sign-Value Independent Decomposition (SVID)**: - The weight matrix \( W \) is decomposed into a sign matrix \( W_{\text{sign}} \) and a value matrix \( W_{\text{value}} \). - The value matrix \( W_{\text{value}} \) is further approximately decomposed into the outer product of two vectors, i.e., \( W_{\text{value}} \approx ab^T \). - NMF or SVD is used for matrix decomposition, and experiments show that NMF helps in faster convergence. 3. **Knowledge Distillation**: - Cross-entropy loss and mean squared error loss are used to guide the quantized student model. - Through knowledge distillation, the capabilities of the original model are effectively transferred to the 1-bit model. ### Experimental Results - **Perplexity**: - On the WikiText2 and C4 datasets, the perplexity of the OneBit method significantly outperforms other baseline methods, especially approaching the performance of FP16 models in large-scale models. - **Zero-shot Task Accuracy**: - On zero-shot tasks such as Winograde, HellaSwag, PIQA, BoolQ, and ARC, the performance of the OneBit method is closest to the FP16 model, with minimal performance loss. - **Model Compression Ratio**: - Experimental results show that as the model scale increases, the compression ratio of the OneBit method gradually increases, reaching up to 93.4%. ### Conclusion The paper successfully compresses the weight matrices of LLMs to 1-bit, significantly reducing the storage and computational overhead of the model while maintaining high performance. This method has important implications for practical applications, especially for deployment on resource-constrained devices such as mobile devices and PCs.