Abstract:Model quantification uses low bit-width values to represent the weight matrices of existing models to be quantized, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, current quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. For this target, we introduce a 1-bit model compressing framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the quantization framework. Sufficient experimental results indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes when only using 1-bit weight matrices.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper attempts to address the issue of excessive computational and storage overhead faced by large language models (LLMs) during deployment. Specifically, the paper proposes a new framework called OneBit, which compresses the weight matrices of LLMs to 1-bit, enabling ultra-low bit-width deployment. This method aims to reduce the storage and computational overhead of the model while maintaining high performance. ### Main Contributions 1. **1-bit Model Architecture**: - A new 1-bit linear layer architecture is proposed, which introduces two floating-point value vectors to compensate for the precision loss during quantization. - This architecture demonstrates higher stability and efficiency during training and inference. 2. **Sign-Value Independent Decomposition (SVID)**: - A new parameter initialization method, SVID, is proposed, which decomposes high-bit matrices into low-bit matrices to better initialize the 1-bit model. - Experiments show that SVID-based initialization can improve model performance and convergence speed. 3. **Knowledge Distillation**: - A quantization-aware knowledge distillation method is used to transfer the capabilities of the original model to the 1-bit model. - Experimental results show that this method performs well across different scales of models, especially approaching the performance of FP16 models in large-scale models. ### Method Overview 1. **1-bit Linear Layer Architecture**: - The weight matrix \( W \) is represented as a sign matrix \( W_{\pm1} \) and two value vectors \( g \) and \( h \). - During training, floating-point value vectors \( g \) and \( h \) are used to compensate for precision loss. - During inference, the sign matrix \( W_{\pm1} \) is packed into INT1 format to reduce storage overhead. 2. **Sign-Value Independent Decomposition (SVID)**: - The weight matrix \( W \) is decomposed into a sign matrix \( W_{\text{sign}} \) and a value matrix \( W_{\text{value}} \). - The value matrix \( W_{\text{value}} \) is further approximately decomposed into the outer product of two vectors, i.e., \( W_{\text{value}} \approx ab^T \). - NMF or SVD is used for matrix decomposition, and experiments show that NMF helps in faster convergence. 3. **Knowledge Distillation**: - Cross-entropy loss and mean squared error loss are used to guide the quantized student model. - Through knowledge distillation, the capabilities of the original model are effectively transferred to the 1-bit model. ### Experimental Results - **Perplexity**: - On the WikiText2 and C4 datasets, the perplexity of the OneBit method significantly outperforms other baseline methods, especially approaching the performance of FP16 models in large-scale models. - **Zero-shot Task Accuracy**: - On zero-shot tasks such as Winograde, HellaSwag, PIQA, BoolQ, and ARC, the performance of the OneBit method is closest to the FP16 model, with minimal performance loss. - **Model Compression Ratio**: - Experimental results show that as the model scale increases, the compression ratio of the OneBit method gradually increases, reaching up to 93.4%. ### Conclusion The paper successfully compresses the weight matrices of LLMs to 1-bit, significantly reducing the storage and computational overhead of the model while maintaining high performance. This method has important implications for practical applications, especially for deployment on resource-constrained devices such as mobile devices and PCs.

OneBit: Towards Extremely Low-bit Large Language Models

OneBit: Towards Extremely Low-bit Large Language Models

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

How Good Are Low-bit Quantized LLaMA3 Models? an Empirical Study

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization

Foundations of Large Language Model Compression -- Part 1: Weight Quantization