"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Eldar Kurtic,Alexandre Marques,Shubhra Pandit,Mark Kurtz,Dan Alistarh
2024-11-05
Abstract:Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the trade - off between accuracy and performance under different quantization formats during the quantization process of large - language models (LLMs). Specifically, through comprehensive empirical research, the paper evaluates the quantization accuracy of popular quantization formats (such as FP8, INT8, INT4) in various academic benchmark tests and practical tasks, and explores the impact of these quantization formats on the quality of text generation. In addition, the paper also analyzes the performance of deployed quantized LLMs on different hardware architectures, aiming to provide a set of practical guiding principles for practical applications to help select the quantization format most suitable for a specific deployment environment. The main contributions of the paper include: 1. **Quantization accuracy evaluation**: Systematically evaluate the model accuracy loss under different quantization formats, especially the performance of FP8 weight and activation quantization (W8A8 - FP), INT8 weight and activation quantization (W8A8 - INT), and INT4 weight quantization (W4A16 - INT). 2. **Text generation quality analysis**: Through an automated evaluation suite, compare the similarity between the quantized model and the unquantized model when generating text, ensuring that the quantized model can still maintain high semantic consistency and structural stability. 3. **Performance analysis**: Use the popular open - source vLLM framework to conduct performance tests on different GPU architectures, providing performance comparisons in synchronous and asynchronous deployment scenarios, helping to determine the most appropriate quantization scheme in different hardware and application scenarios. Through these studies, the paper provides valuable references for the practical application of quantization techniques, which helps to reduce the accuracy loss caused by quantization while increasing the model's inference speed and deployment efficiency.