A Speed Odyssey for Deployable Quantization of LLMs

Qingyuan Li,Ran Meng,Yiduo Li,Bo Zhang,Liang Li,Yifan Lu,Xiangxiang Chu,Yerui Sun,Yuchen Xie

DOI: https://doi.org/10.48550/arXiv.2311.09550

2023-11-16

Abstract:The large language model era urges faster and less costly inference. Prior model compression works on LLMs tend to undertake a software-centric approach primarily focused on the simulated quantization performance. By neglecting the feasibility of deployment, these approaches are typically disabled in real practice. They used to drastically push down the quantization bit range for a reduced computation which might not be supported by the mainstream hardware, or involve sophisticated algorithms that introduce extra computation or memory access overhead. We argue that pursuing a hardware-centric approach in the construction of quantization algorithms is crucial. In this regard, we are driven to build our compression method on top of hardware awareness, eliminating impractical algorithm choices while maximizing the benefit of hardware acceleration. Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies. Extensive experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to \textbf{4$\times$} compared to Hugging Face FP16 inference and \textbf{2.23$\times$} vs. the state-of-the-art inference engine TensorRT-LLM in FP16, and \textbf{1.45$\times$} vs. TensorRT-LLM in INT8, yet without substantially harming the performance.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

This paper attempts to solve the problems of high latency and high computational cost encountered in the practical deployment of large - language models (LLMs). Specifically, the paper focuses on how to reduce the memory footprint of the model and accelerate the inference speed through quantization techniques while maintaining the model performance without significant degradation. Existing quantization methods usually focus on performance simulation at the software level and overlook the feasibility of practical deployment, which makes these methods often infeasible in practice. For example, some methods will significantly reduce the quantization bit width to reduce the amount of computation, but this may not be supported by mainstream hardware or introduce additional computation or memory access overhead. To overcome these problems, the paper proposes a hardware - centered approach to constructing quantization algorithms to ensure that the designed compression method can run effectively on existing hardware. Specifically, the paper proposes a method named OdysseyLLM, which includes a novel W4A8 kernel to implement FastGEMM and a set of combined quantization strategies. Experimental results show that OdysseyLLM performs excellently in a variety of common language benchmarks and can significantly improve the inference speed while hardly harming the model performance. For example, compared with the FP16 inference of Hugging Face, the speed of OdysseyLLM is increased by 4 times; compared with the FP16 and INT8 implementations of the state - of - the - art inference engine TensorRT - LLM, it is increased by 2.23 times and 1.45 times respectively.

A Speed Odyssey for Deployable Quantization of LLMs

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

On the Compressibility of Quantized Large Language Models

COMET: Towards Partical W4A4KV4 LLMs Serving

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization

SqueezeLLM: Dense-and-Sparse Quantization

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models