Scaling Laws for Mixed quantization in Large Language Models

Zeyu Cao,Cheng Zhang,Pedro Gimenes,Jianqiao Lu,Jianyi Cheng,Yiren Zhao
2024-10-09
Abstract:Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the computational requirements for running inference on these models. In this study, we focus on a straightforward question: When aiming for a specific accuracy or perplexity target for low-precision quantization, how many high-precision numbers or calculations are required to preserve as we scale LLMs to larger sizes? We first introduce a critical metric named the quantization ratio, which compares the number of parameters quantized to low-precision arithmetic against the total parameter count. Through extensive and carefully controlled experiments across different model families, arithmetic types, and quantization granularities (e.g. layer-wise, matmul-wise), we identify two central phenomenons. 1) The larger the models, the better they can preserve performance with an increased quantization ratio, as measured by perplexity in pre-training tasks or accuracy in downstream tasks. 2) The finer the granularity of mixed-precision quantization (e.g., matmul-wise), the more the model can increase the quantization ratio. We believe these observed phenomena offer valuable insights for future AI hardware design and the development of advanced Efficient AI algorithms.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in low - precision quantization in large - language models (LLMs), in order to achieve specific accuracy or perplexity goals, as the model scale expands, how many high - precision parameters or amounts of computation need to be retained? Specifically, the authors focus on the performance of mixed - precision quantization under different model sizes and attempt to find the relationship between the mixed - precision quantization ratio (i.e., the proportion of low - precision parameters in the total parameters) and model performance. Through this research, they hope to provide valuable insights for future AI hardware design and the development of efficient AI algorithms. Two main phenomena in the paper are: 1. The larger the model, the smaller the performance loss when using a higher quantization ratio. This is reflected in both the perplexity of pre - training tasks and the accuracy of downstream tasks. 2. The finer the granularity of mixed - precision quantization (for example, by matrix - multiplication operations), the higher the quantization ratio the model can withstand while maintaining performance. These findings are of great significance for understanding the quantization laws of large - scale language models and are helpful for guiding future research and technological development in mixed - precision quantization.