What Makes Quantization for Large Language Models Hard? an Empirical Study from the Lens of Perturbation

Zhuocheng Gong,Jiahao Liu,Jingang Wang,Xunliang Cai,Dongyan Zhao,Rui Yan
DOI: https://doi.org/10.1609/aaai.v38i16.29765
2024-01-01
Abstract:Quantization has emerged as a promising technique for improving the memoryand computational efficiency of large language models (LLMs). Though thetrade-off between performance and efficiency is well-known, there is still muchto be learned about the relationship between quantization and LLM performance.To shed light on this relationship, we propose a new perspective onquantization, viewing it as perturbations added to the weights and activationsof LLMs. We call this approach "the lens of perturbation". Using this lens, weconduct experiments with various artificial perturbations to explore theirimpact on LLM performance. Our findings reveal several connections between theproperties of perturbations and LLM performance, providing insights into thefailure cases of uniform quantization and suggesting potential solutions toimprove the robustness of LLM quantization. To demonstrate the significance ofour findings, we implement a simple non-uniform quantization approach based onour insights. Our experiments show that this approach achieves minimalperformance degradation on both 4-bit weight quantization and 8-bitquantization for weights and activations. These results validate thecorrectness of our approach and highlight its potential to improve theefficiency of LLMs without sacrificing performance.
What problem does this paper attempt to address?