QuIP: 2-Bit Quantization of Large Language Models With Guarantees

Jerry Chee,Yaohui Cai,Volodymyr Kuleshov,Christopher De Sa
2024-01-16
Abstract:This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $\textit{incoherent}$ weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at <a class="link-external link-https" href="https://github.com/Cornell-RelaxML/QuIP" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "QuIP: 2 - Bit Quantization of Large Language Models With Guarantees" aims to solve the efficiency and accuracy problems in post - training quantization of large language models (LLMs). Specifically, the paper proposes a new method named QuIP (Quantization with Incoherence Processing), which optimizes the quantization process of weights and Hessian matrices by introducing incoherence processing. ### Main contributions 1. **Propose the QuIP method**: Based on the insight that model parameters should be as incoherent as possible, a new quantization method QuIP is proposed. 2. **Theoretical analysis**: Provide the first theoretical analysis of the adaptive rounding method applicable to the quantization of large - scale language models, and prove that the QuIP method is optimal among a class of adaptive rounding methods. 3. **Empirical results**: Verified by experiments, the QuIP method can achieve effective quantization of large - scale language models when using only 2 - bit weights, which cannot be achieved by other methods. ### Method overview The QuIP method consists of two main steps: 1. **Adaptive rounding step**: - Perform adaptive rounding by minimizing a quadratic surrogate objective function \(\ell(\hat{W})=\text{tr}((\hat{W} - W)H(\hat{W} - W)^T)\), where \(W\) is the original weight matrix, \(\hat{W}\) is the quantized weight matrix, and \(H\) is the Hessian matrix. - Use linear feedback \(U\) for iterative update, and finally form an optimal adaptive rounding method LDLQ. 2. **Efficient pre - processing and post - processing**: - Ensure the incoherence of weights and Hessian matrices by multiplying by a random orthogonal matrix. - The pre - processing step includes diagonal scaling and random orthogonal transformation of weights and Hessian matrices to reduce the influence of outliers. - The post - processing step reversely executes these transformations to restore the original weights and Hessian matrices. ### Experimental results - **Performance comparison**: The QuIP method outperforms the existing OPTQ method under both 2 - bit and 3 - bit quantization. - **Model size and task evaluation**: QuIP performs well on LLMs of different scales (from 1B to 66B parameters), especially when performing 2 - bit quantization, it can approach the performance of the full - precision model. - **Practical applications**: The QuIP method performs excellently in language generation tasks (such as WikiText2, Penn Treebank, C4) and zero - shot tasks (such as LAMBADA, ARC Easy, PiQA, StoryCloze). ### Conclusion The QuIP method significantly improves the performance of large - scale language models under low - bit quantization by introducing incoherence processing, providing a new solution for the efficient deployment of large - scale language models.