"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Eldar Kurtic,Alexandre Marques,Shubhra Pandit,Mark Kurtz,Dan Alistarh

2024-11-05

Abstract:Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the trade - off between accuracy and performance under different quantization formats during the quantization process of large - language models (LLMs). Specifically, through comprehensive empirical research, the paper evaluates the quantization accuracy of popular quantization formats (such as FP8, INT8, INT4) in various academic benchmark tests and practical tasks, and explores the impact of these quantization formats on the quality of text generation. In addition, the paper also analyzes the performance of deployed quantized LLMs on different hardware architectures, aiming to provide a set of practical guiding principles for practical applications to help select the quantization format most suitable for a specific deployment environment. The main contributions of the paper include: 1. **Quantization accuracy evaluation**: Systematically evaluate the model accuracy loss under different quantization formats, especially the performance of FP8 weight and activation quantization (W8A8 - FP), INT8 weight and activation quantization (W8A8 - INT), and INT4 weight quantization (W4A16 - INT). 2. **Text generation quality analysis**: Through an automated evaluation suite, compare the similarity between the quantized model and the unquantized model when generating text, ensuring that the quantized model can still maintain high semantic consistency and structural stability. 3. **Performance analysis**: Use the popular open - source vLLM framework to conduct performance tests on different GPU architectures, providing performance comparisons in synchronous and asynchronous deployment scenarios, helping to determine the most appropriate quantization scheme in different hardware and application scenarios. Through these studies, the paper provides valuable references for the practical application of quantization techniques, which helps to reduce the accuracy loss caused by quantization while increasing the model's inference speed and deployment efficiency.

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

How Does Quantization Affect Multilingual LLMs?

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

SqueezeLLM: Dense-and-Sparse Quantization

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

Understanding the Impact of Post-Training Quantization on Large Language Models

Evaluating Quantized Large Language Models

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM