Abstract:This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $\textit{incoherent}$ weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at <a class="link-external link-https" href="https://github.com/Cornell-RelaxML/QuIP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "QuIP: 2 - Bit Quantization of Large Language Models With Guarantees" aims to solve the efficiency and accuracy problems in post - training quantization of large language models (LLMs). Specifically, the paper proposes a new method named QuIP (Quantization with Incoherence Processing), which optimizes the quantization process of weights and Hessian matrices by introducing incoherence processing. ### Main contributions 1. **Propose the QuIP method**: Based on the insight that model parameters should be as incoherent as possible, a new quantization method QuIP is proposed. 2. **Theoretical analysis**: Provide the first theoretical analysis of the adaptive rounding method applicable to the quantization of large - scale language models, and prove that the QuIP method is optimal among a class of adaptive rounding methods. 3. **Empirical results**: Verified by experiments, the QuIP method can achieve effective quantization of large - scale language models when using only 2 - bit weights, which cannot be achieved by other methods. ### Method overview The QuIP method consists of two main steps: 1. **Adaptive rounding step**: - Perform adaptive rounding by minimizing a quadratic surrogate objective function $\ell(\hat{W})=\text{tr}((\hat{W} - W)H(\hat{W} - W)^T)$, where $W$ is the original weight matrix, $\hat{W}$ is the quantized weight matrix, and $H$ is the Hessian matrix. - Use linear feedback $U$ for iterative update, and finally form an optimal adaptive rounding method LDLQ. 2. **Efficient pre - processing and post - processing**: - Ensure the incoherence of weights and Hessian matrices by multiplying by a random orthogonal matrix. - The pre - processing step includes diagonal scaling and random orthogonal transformation of weights and Hessian matrices to reduce the influence of outliers. - The post - processing step reversely executes these transformations to restore the original weights and Hessian matrices. ### Experimental results - **Performance comparison**: The QuIP method outperforms the existing OPTQ method under both 2 - bit and 3 - bit quantization. - **Model size and task evaluation**: QuIP performs well on LLMs of different scales (from 1B to 66B parameters), especially when performing 2 - bit quantization, it can approach the performance of the full - precision model. - **Practical applications**: The QuIP method performs excellently in language generation tasks (such as WikiText2, Penn Treebank, C4) and zero - shot tasks (such as LAMBADA, ARC Easy, PiQA, StoryCloze). ### Conclusion The QuIP method significantly improves the performance of large - scale language models under low - bit quantization by introducing incoherence processing, providing a new solution for the efficient deployment of large - scale language models.

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Evaluating Quantized Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

AffineQuant: Affine Transformation Quantization for Large Language Models

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

LRQuant: Learnable and Robust Post-Training Quantization for Large Language Models

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models