Evaluating Quantized Large Language Models

Shiyao Li,Xuefei Ning,Luning Wang,Tengxuan Liu,Xiangsheng Shi,Shengen Yan,Guohao Dai,Huazhong Yang,Yu Wang

2024-06-06

Abstract:Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in <a class="link-external link-https" href="https://github.com/thu-nics/qllm-eval" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper primarily aims to address the issues of memory consumption and computational overhead faced by large language models (LLMs) during deployment, optimizing these aspects through Post-Training Quantization (PTQ) techniques. Specifically, the research objectives include: 1. **Comprehensive evaluation of quantization methods on LLMs**: The paper systematically evaluates the impact of different quantization techniques (such as weight quantization, activation quantization, key-value cache quantization, etc.) on the performance of various tasks through experiments to guide the selection of quantization methods. 2. **Exploration of different types of quantization techniques**: The research not only focuses on weight quantization and activation quantization but also delves into the effects of key-value cache quantization, which is particularly important for handling long texts or large-scale datasets. 3. **Coverage of diverse task types**: The evaluation scope includes basic natural language processing tasks, emerging capabilities (such as multi-step reasoning, self-calibration), trustworthiness-related tasks (such as ethical judgment, hallucination detection), dialogue tasks, and long text processing tasks. 4. **Wide range of model families and scales**: The research evaluates multiple model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, with model parameters ranging from 125 million to 18 billion. 5. **Summary of quantization effects and recommended strategies**: Based on extensive experimental results, the paper summarizes the impact of quantization techniques on different tasks and provides bit-width recommendations for quantization for different types of tasks to ensure minimal performance loss. In summary, the paper aims to reveal the performance of quantization techniques in different scenarios through systematic experimental evaluation, providing guidance for the selection of quantization techniques in practical applications.

Evaluating Quantized Large Language Models

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

A Comprehensive Study on Quantization Techniques for Large Language Models

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

RPTQ: Reorder-based Post-training Quantization for Large Language Models

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

AffineQuant: Affine Transformation Quantization for Large Language Models