A Speed Odyssey for Deployable Quantization of LLMs

Qingyuan Li,Ran Meng,Yiduo Li,Bo Zhang,Liang Li,Yifan Lu,Xiangxiang Chu,Yerui Sun,Yuchen Xie
DOI: https://doi.org/10.48550/arXiv.2311.09550
2023-11-16
Abstract:The large language model era urges faster and less costly inference. Prior model compression works on LLMs tend to undertake a software-centric approach primarily focused on the simulated quantization performance. By neglecting the feasibility of deployment, these approaches are typically disabled in real practice. They used to drastically push down the quantization bit range for a reduced computation which might not be supported by the mainstream hardware, or involve sophisticated algorithms that introduce extra computation or memory access overhead. We argue that pursuing a hardware-centric approach in the construction of quantization algorithms is crucial. In this regard, we are driven to build our compression method on top of hardware awareness, eliminating impractical algorithm choices while maximizing the benefit of hardware acceleration. Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies. Extensive experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to \textbf{4$\times$} compared to Hugging Face FP16 inference and \textbf{2.23$\times$} vs. the state-of-the-art inference engine TensorRT-LLM in FP16, and \textbf{1.45$\times$} vs. TensorRT-LLM in INT8, yet without substantially harming the performance.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
This paper attempts to solve the problems of high latency and high computational cost encountered in the practical deployment of large - language models (LLMs). Specifically, the paper focuses on how to reduce the memory footprint of the model and accelerate the inference speed through quantization techniques while maintaining the model performance without significant degradation. Existing quantization methods usually focus on performance simulation at the software level and overlook the feasibility of practical deployment, which makes these methods often infeasible in practice. For example, some methods will significantly reduce the quantization bit width to reduce the amount of computation, but this may not be supported by mainstream hardware or introduce additional computation or memory access overhead. To overcome these problems, the paper proposes a hardware - centered approach to constructing quantization algorithms to ensure that the designed compression method can run effectively on existing hardware. Specifically, the paper proposes a method named OdysseyLLM, which includes a novel W4A8 kernel to implement FastGEMM and a set of combined quantization strategies. Experimental results show that OdysseyLLM performs excellently in a variety of common language benchmarks and can significantly improve the inference speed while hardly harming the model performance. For example, compared with the FP16 inference of Hugging Face, the speed of OdysseyLLM is increased by 4 times; compared with the FP16 and INT8 implementations of the state - of - the - art inference engine TensorRT - LLM, it is increased by 2.23 times and 1.45 times respectively.