Scaling Laws for Mixed quantization in Large Language Models

Zeyu Cao,Cheng Zhang,Pedro Gimenes,Jianqiao Lu,Jianyi Cheng,Yiren Zhao

2024-10-09

Abstract:Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the computational requirements for running inference on these models. In this study, we focus on a straightforward question: When aiming for a specific accuracy or perplexity target for low-precision quantization, how many high-precision numbers or calculations are required to preserve as we scale LLMs to larger sizes? We first introduce a critical metric named the quantization ratio, which compares the number of parameters quantized to low-precision arithmetic against the total parameter count. Through extensive and carefully controlled experiments across different model families, arithmetic types, and quantization granularities (e.g. layer-wise, matmul-wise), we identify two central phenomenons. 1) The larger the models, the better they can preserve performance with an increased quantization ratio, as measured by perplexity in pre-training tasks or accuracy in downstream tasks. 2) The finer the granularity of mixed-precision quantization (e.g., matmul-wise), the more the model can increase the quantization ratio. We believe these observed phenomena offer valuable insights for future AI hardware design and the development of advanced Efficient AI algorithms.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in low - precision quantization in large - language models (LLMs), in order to achieve specific accuracy or perplexity goals, as the model scale expands, how many high - precision parameters or amounts of computation need to be retained? Specifically, the authors focus on the performance of mixed - precision quantization under different model sizes and attempt to find the relationship between the mixed - precision quantization ratio (i.e., the proportion of low - precision parameters in the total parameters) and model performance. Through this research, they hope to provide valuable insights for future AI hardware design and the development of efficient AI algorithms. Two main phenomena in the paper are: 1. The larger the model, the smaller the performance loss when using a higher quantization ratio. This is reflected in both the perplexity of pre - training tasks and the accuracy of downstream tasks. 2. The finer the granularity of mixed - precision quantization (for example, by matrix - multiplication operations), the higher the quantization ratio the model can withstand while maintaining performance. These findings are of great significance for understanding the quantization laws of large - scale language models and are helpful for guiding future research and technological development in mixed - precision quantization.

Scaling Laws for Mixed quantization in Large Language Models

Scaling laws for post-training quantized large language models

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Scaling Laws for Precision

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Channel-Wise Mixed-Precision Quantization for Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats

A Comprehensive Study on Quantization Techniques for Large Language Models

What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

What Makes Quantization for Large Language Models Hard? an Empirical Study from the Lens of Perturbation

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox

Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other

Evaluating Quantized Large Language Models

Understanding the difficulty of low-precision post-training quantization of large language models

When Quantization Affects Confidence of Large Language Models?