Abstract:Model reparameterization is a widely accepted technique for improving inference speed without compromising performance. However, current Post-training Quantization (PTQ) methods often lead to significant accuracy degradation when applied to reparameterized models. This is primarily caused by channel-specific and sample-specific outliers, which appear only at specific samples and channels and impact on the selection of quantization parameters. To address this issue, we propose RepAPQ, a novel framework that preserves the accuracy of quantized reparameterization models. Different from previous frameworks using Mean Squared Error (MSE) as a measurement, we utilize Mean Absolute Error (MAE) to mitigate the influence of outliers on quantization parameters. Our framework comprises two main components: Quantization Protecting Reparameterization and Across-block Calibration. For effective calibration, Quantization Protecting Reparameterization combines multiple branches into a single convolution with an affine layer. During training, the affine layer accelerates convergence and amplifies the output of the convolution to better accommodate samples with outliers. Additionally, Across-block Calibration leverages the measurement of stage output as supervision to address the gradient problem introduced by MAE and enhance the interlayer correlation with quantization parameters. Comprehensive experiments demonstrate the effectiveness of RepAPQ across various models and tasks. Our framework outperforms previous methods by approximately 1% for 8-bit PTQ and 2% for 6-bit PTQ, showcasing its superior performance. The code is available at <https://github.com/ilur98/DLMC-QUANT>.

How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Towards Accurate Post-training Quantization for Reparameterized Models

AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Post-training Quantization or Quantization-aware Training? That is the Question

Training Multi-bit Quantized and Binarized Networks with A Learnable Symmetric Quantizer

Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

Post Training Quantization of Large Language Models with Microscaling Formats

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Intriguing Properties of Quantization at Scale

Symmetry Regularization and Saturating Nonlinearity for Robust Quantization

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

Error-aware Quantization through Noise Tempering

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Understanding the difficulty of low-precision post-training quantization of large language models

Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens