Abstract:Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ($\le$ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric $E_8$ lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at <a class="link-external link-https" href="https://github.com/Cornell-RelaxML/quip-sharp" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the huge memory - occupancy challenge faced by large - scale language models (LLMs) during deployment. Specifically, the paper introduces a new weight quantization method, QuIP#, aiming to achieve efficient compression at an extremely low bit rate (≤4 bits per weight) through the following three innovative techniques: 1. **Incoherence Processing**: Use the Randomized Hadamard Transform (RHT) to improve incoherence processing. Compared to previous methods, RHT is not only faster but also has better theoretical properties. 2. **Lattice Codebooks**: Utilize the highly symmetric E8 lattice to design a hardware - friendly codebook to optimize the packing density of the 8 - dimensional unit sphere. This method is particularly suitable for handling spherically distributed weights. 3. **Fine - tuning**: Further improve the performance of the quantized model through inter - layer fine - tuning, ensuring that the quantized model is as close as possible to the performance of the original model. These techniques work together to make QuIP# achieve better performance than existing PTQ methods under extreme compression ratios and support fast inference. Experimental results show that the performance of QuIP# under 3 - bit quantization is even better than that of the theoretically lossless 4 - bit model, which is a previously unseen result. ### Formula Summary - **Quantization Loss Formula**: \[ \ell(\hat{W})=\mathbb{E}_x\left[\|(\hat{W}-W)x\|^2\right]=\text{tr}\left((\hat{W}-W)H(\hat{W}-W)^T\right) \] where $W\in\mathbb{R}^{m\times n}$ is the original weight matrix, $\hat{W}\in\mathbb{R}^{m\times n}$ is the quantized weight matrix, $x\in\mathbb{R}^n$ is an input vector uniformly drawn from the calibration set, and $H = \mathbb{E}[xx^T]$ is a proxy Hessian matrix. - **Incoherence Definition**: - For a Hessian matrix $H\in\mathbb{R}^{n\times n}$, if its eigenvalue decomposition $H = Q\Lambda Q^T$ satisfies $\max_{i,j}|Q_{ij}|\leq\mu/\sqrt{n}$, then $H$ is said to be $\mu$-incoherent. - For a weight matrix $W\in\mathbb{R}^{m\times n}$, if it satisfies $\max_{i,j}|W_{ij}|\leq\mu\|W\|_F/\sqrt{mn}$, then $W$ is said to be $\mu$-incoherent. - **Improved Incoherence Parameters**: - After using RHT, the incoherence parameters $\mu_H$ and $\mu_W$ are respectively: \[ \mu_H=\sqrt{2\log\left(\frac{2n^2}{\delta}\right)} \] \[ \mu_W = 2\log\left(\frac{4mn}{\delta}\right) \] Through these methods, QuIP# significantly improves the performance of the quantized model, especially in the case of extremely low bit rates.

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

QTIP: Quantization with Trellises and Incoherence Processing

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

AffineQuant: Affine Transformation Quantization for Large Language Models

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

FlatQuant: Flatness Matters for LLM Quantization

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

Pyramid Vector Quantization for LLMs

CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

AdpQ: A Zero-shot Calibration Free Adaptive Post Training Quantization Method for LLMs

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression