QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

Saleh Ashkboos,Ilia Markov,Elias Frantar,Tingxuan Zhong,Xincheng Wang,Jie Ren,Torsten Hoefler,Dan Alistarh

2023-11-02

Abstract:Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. The key feature of our scheme is that it is designed with computational efficiency in mind: we provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x relative to FP16 execution. We provide detailed studies for models from the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate inference using quantization plus 2:4 sparsity. Code is available at: <a class="link-external link-https" href="https://github.com/IST-DASLab/QUIK" rel="external noopener nofollow">this https URL</a>.

Machine Learning

What problem does this paper attempt to address?

This paper mainly discusses how to achieve end-to-end 4-bit inference for large language models (LLMs), especially for generative models such as LLaMA, OPT, and Falcon. Most existing work focuses only on weight quantization, which can reduce runtime costs in memory-constrained single-token generation scenarios, but cannot solve the efficiency problem in computation-intensive scenarios such as batch inference or computation in prompt handling. The paper proposes a hybrid quantization strategy called QUIK, which quantizes the majority of weights and activations to 4 bits while keeping a portion of exceptional weights and activations at higher precision. QUIK is designed considering computational efficiency, provides matching GPU kernels, and achieves up to 3.4 times higher actual throughput compared to FP16 execution. With this approach, QuiK significantly improves the computational speed of modern LLMs while maintaining good accuracy and reducing memory requirements. Experimental results show that QuiK achieves speed improvements across different model sizes and even achieves 3.4 times acceleration with minor accuracy loss for sensitive LLaMA-2 models, even with a parameter count of 70 billion. Additionally, QuiK reduces the GPU memory requirements, allowing for fewer GPUs to accurately execute LLMs in FP16 format.

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

SqueezeLLM: Dense-and-Sparse Quantization

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

GPTQT: Quantize Large Language Models Twice to Push the Efficiency

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

A Speed Odyssey for Deployable Quantization of LLMs

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models