SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim,Coleman Hooper,Amir Gholami,Zhen Dong,Xiuyu Li,Sheng Shen,Michael W. Mahoney,Kurt Keutzer

2024-06-05

Abstract:Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is available at <a class="link-external link-https" href="https://github.com/SqueezeAILab/SqueezeLLM" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper primarily addresses the issue of high resource demands and costly deployment of Large Language Models (LLMs), especially when memory bandwidth becomes a bottleneck during single-batch inference. To overcome these challenges, the research team introduced SqueezeLLM, a post-training quantization framework that not only achieves up to 3-bit ultra-low precision lossless compression but also delivers higher quantization performance under the same memory constraints. Specifically, SqueezeLLM tackles the problem through the following two innovations: 1. **Sensitivity-based non-uniform quantization**: Instead of a simple uniform allocation, different quantization precisions are assigned based on the sensitivity of the model output to the weight values. This method optimizes the quantization process by searching for the best bit precision allocation based on second-order information. 2. **Dense and sparse decomposition**: Outliers and sensitive weight values are stored in an efficient sparse format, while other weights are densely quantized. This approach helps to narrow the weight range, making quantization more precise. Experiments on the LLaMA model show that SqueezeLLM's 3-bit quantization significantly reduces the perplexity gap relative to the FP16 baseline, and deployment on an A6000 GPU achieved up to 2.3 times speed improvement over the baseline. Moreover, compared to other state-of-the-art methods, SqueezeLLM demonstrates better performance under the same memory requirements.

SqueezeLLM: Dense-and-Sparse Quantization

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

SDQ: Sparse Decomposed Quantization for LLM Inference

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Pushing the Envelope of Low-Bit LLM via Dynamic Error Compensation

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Extreme Compression of Large Language Models via Additive Quantization

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Compressing Large Language Models by Joint Sparsification and Quantization