QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

Albert Tseng,Jerry Chee,Qingyao Sun,Volodymyr Kuleshov,Christopher De Sa
2024-06-04
Abstract:Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ($\le$ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric $E_8$ lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at <a class="link-external link-https" href="https://github.com/Cornell-RelaxML/quip-sharp" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the huge memory - occupancy challenge faced by large - scale language models (LLMs) during deployment. Specifically, the paper introduces a new weight quantization method, QuIP#, aiming to achieve efficient compression at an extremely low bit rate (≤4 bits per weight) through the following three innovative techniques: 1. **Incoherence Processing**: Use the Randomized Hadamard Transform (RHT) to improve incoherence processing. Compared to previous methods, RHT is not only faster but also has better theoretical properties. 2. **Lattice Codebooks**: Utilize the highly symmetric E8 lattice to design a hardware - friendly codebook to optimize the packing density of the 8 - dimensional unit sphere. This method is particularly suitable for handling spherically distributed weights. 3. **Fine - tuning**: Further improve the performance of the quantized model through inter - layer fine - tuning, ensuring that the quantized model is as close as possible to the performance of the original model. These techniques work together to make QuIP# achieve better performance than existing PTQ methods under extreme compression ratios and support fast inference. Experimental results show that the performance of QuIP# under 3 - bit quantization is even better than that of the theoretically lossless 4 - bit model, which is a previously unseen result. ### Formula Summary - **Quantization Loss Formula**: \[ \ell(\hat{W})=\mathbb{E}_x\left[\|(\hat{W}-W)x\|^2\right]=\text{tr}\left((\hat{W}-W)H(\hat{W}-W)^T\right) \] where \(W\in\mathbb{R}^{m\times n}\) is the original weight matrix, \(\hat{W}\in\mathbb{R}^{m\times n}\) is the quantized weight matrix, \(x\in\mathbb{R}^n\) is an input vector uniformly drawn from the calibration set, and \(H = \mathbb{E}[xx^T]\) is a proxy Hessian matrix. - **Incoherence Definition**: - For a Hessian matrix \(H\in\mathbb{R}^{n\times n}\), if its eigenvalue decomposition \(H = Q\Lambda Q^T\) satisfies \(\max_{i,j}|Q_{ij}|\leq\mu/\sqrt{n}\), then \(H\) is said to be \(\mu\)-incoherent. - For a weight matrix \(W\in\mathbb{R}^{m\times n}\), if it satisfies \(\max_{i,j}|W_{ij}|\leq\mu\|W\|_F/\sqrt{mn}\), then \(W\) is said to be \(\mu\)-incoherent. - **Improved Incoherence Parameters**: - After using RHT, the incoherence parameters \(\mu_H\) and \(\mu_W\) are respectively: \[ \mu_H=\sqrt{2\log\left(\frac{2n^2}{\delta}\right)} \] \[ \mu_W = 2\log\left(\frac{4mn}{\delta}\right) \] Through these methods, QuIP# significantly improves the performance of the quantized model, especially in the case of extremely low bit rates.