MixQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction

Yidong Chen,Chen Zhang,Rongchao Dong,Haoyuan Zhang,Yonghua Zhang,Zhonghua Lu,Jidong Zhai
DOI: https://doi.org/10.1109/sc41406.2024.00080
2024-01-01
Abstract:Mixed-precision quantization has shown to be a promising method for enhancing the efficiency of LLMs. This technique boosts computational efficiency by processing most values with low-precision, high-throughput compute units and maintains accuracy by processing outliers in high-precision. However, due to the dynamic, irregular, and sparse nature of outliers, this approach is far from using hardware efficiently. In this work, we propose MixQ, an efficient mixed-precision quantization system. Through our in-depth analysis of outlier distribution, we introduce a locality-based outlier prediction algorithm that can predict all outliers of 95.8% of tokens. Based on this accurate prediction, we propose a quantization ahead of detection (QAD) technique that can verify the correctness of prediction. A new data structure is proposed for efficient outlier processing. Evaluation shows that MixQ achieves 1.52× and 1.78× speedup over FP16 and Bitsandbytes on 8-bit quantization; plus 1.48×, 1.93× and 6× speedup over QUIK, FP16, and AWQ on 4-bit quantization.1
What problem does this paper attempt to address?