Abstract:Large language models (LLMs) achieve remarkable performance in natural language understanding but require substantial computation and memory resources. Post-training quantization (PTQ) is a powerful compression technique extensively investigated in LLMs. However, existing PTQ methods are still not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. Standard PTQ methods using group-wise quantization suffer difficulties in quantizing LLMs accurately to such low-bit, but advanced methods remaining high-precision weights element-wisely are hard to realize their theoretical hardware efficiency. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM. The scheme exploits the salience distribution of weights to determine optimal bit-width and quantizers for accurate LLM quantization, while aligning bit-width partition to groups for compact memory usage and fast integer inference. Specifically, the proposed SliM-LLM mainly relies on two novel techniques: (1) Salience-Determined Bit Allocation utilizes the clustering characteristics of salience distribution to allocate the bit-widths of each group, increasing the accuracy of quantized LLMs and maintaining the inference efficiency; (2) Salience-Weighted Quantizer Calibration optimizes the parameters of the quantizer by considering the element-wise salience within the group, balancing the maintenance of salient information and minimization of errors. Comprehensive experiments show that SliM-LLM significantly improves the accuracy of LLMs at ultra-low bits, e.g., 2-bit LLaMA-7B achieves a 5.5-times memory-saving than original model on NVIDIA A800 GPUs, and 48% decrease of perplexity compared to the state-of-the-art gradient-free PTQ method. Moreover, SliM-LLM+, which is integrated from the extension of SliM-LLM with gradient-based quantizers, further reduces perplexity by 35.1%.

DB-LLM: Accurate Dual-Binarization for Efficient LLMs

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

PB-LLM: Partially Binarized Large Language Models

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

ARB-LLM: Alternating Refined Binarizations for Large Language Models

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

TernaryLLM: Ternarized Large Language Model

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs

BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit