Evaluating Quantized Large Language Models

Shiyao Li,Xuefei Ning,Luning Wang,Tengxuan Liu,Xiangsheng Shi,Shengen Yan,Guohao Dai,Huazhong Yang,Yu Wang
2024-06-06
Abstract:Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in <a class="link-external link-https" href="https://github.com/thu-nics/qllm-eval" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily aims to address the issues of memory consumption and computational overhead faced by large language models (LLMs) during deployment, optimizing these aspects through Post-Training Quantization (PTQ) techniques. Specifically, the research objectives include: 1. **Comprehensive evaluation of quantization methods on LLMs**: The paper systematically evaluates the impact of different quantization techniques (such as weight quantization, activation quantization, key-value cache quantization, etc.) on the performance of various tasks through experiments to guide the selection of quantization methods. 2. **Exploration of different types of quantization techniques**: The research not only focuses on weight quantization and activation quantization but also delves into the effects of key-value cache quantization, which is particularly important for handling long texts or large-scale datasets. 3. **Coverage of diverse task types**: The evaluation scope includes basic natural language processing tasks, emerging capabilities (such as multi-step reasoning, self-calibration), trustworthiness-related tasks (such as ethical judgment, hallucination detection), dialogue tasks, and long text processing tasks. 4. **Wide range of model families and scales**: The research evaluates multiple model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, with model parameters ranging from 125 million to 18 billion. 5. **Summary of quantization effects and recommended strategies**: Based on extensive experimental results, the paper summarizes the impact of quantization techniques on different tasks and provides bit-width recommendations for quantization for different types of tasks to ensure minimal performance loss. In summary, the paper aims to reveal the performance of quantization techniques in different scenarios through systematic experimental evaluation, providing guidance for the selection of quantization techniques in practical applications.