FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Young Jin Kim,Rawn Henry,Raffy Fahim,Hany Hassan Awadalla

2023-08-17

Abstract:Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models suffer from high inference costs caused by the memory bandwidth bottleneck in the auto-regressive decoding process. To address these issues, we propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. To ensure minimal quality degradation, we introduce a simple and effective heuristic approach that utilizes only the model weights of a pre-trained model. This approach is applicable to both Mixture-of-Experts (MoE) and dense models without requiring additional fine-tuning. To demonstrate the effectiveness of our proposed method, we first analyze the challenges and issues associated with LLM quantization. Subsequently, we present our heuristic approach, which adaptively finds the granularity of quantization, effectively addressing these problems. Furthermore, we implement highly efficient GPU GEMMs that perform on-the-fly matrix multiplication and dequantization, supporting the multiplication of fp16 or bf16 activations with int8 or int4 weights. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput on the same number of GPUs.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

The paper primarily addresses the issues of high memory requirements and inference costs faced by large language models (LLMs) during actual deployment, proposing an efficient weight quantization method called FineQuant. Specifically, the main contributions of the paper are as follows: 1. **In-depth Analysis of Quantization Behavior**: Conducts extensive analysis of quantization behavior on language models, particularly exploring the impact of low-bit quantization (down to 3 bits) on LLM accuracy. 2. **Fine-grained Quantization Algorithm**: Proposes a fine-grained quantization algorithm that combines inter-group quantization and adaptive granularity selection to minimize quality degradation caused by quantization. 3. **Efficient GPU Kernel Implementation**: Implements highly optimized GPU kernels and performs comprehensive performance analysis, including scenarios with different batch sizes and context lengths, to determine the optimal utilization of the method on real GPUs. 4. **Accelerated Inference for Large-scale Models**: Demonstrates the effectiveness of the method by applying it to large-scale open-source dense transformer models OPT-175B and internal mixture of experts (MoE) models. Experimental results show that it can significantly reduce resource consumption and costs while improving model inference throughput without sacrificing accuracy. In summary, the paper aims to address the issues of high memory consumption and low inference efficiency of LLMs during actual deployment by proposing an effective weight quantization method to enhance the practicality and efficiency of these models.

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

SqueezeLLM: Dense-and-Sparse Quantization

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

MobileQuant: Mobile-friendly Quantization for On-device Language Models

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models