Abstract:Large transformer models have demonstrated remarkable success. Post-training quantization (PTQ), which requires only a small dataset for calibration and avoids end-to-end retraining, is a promising solution for compressing these large models. Regrettably, existing PTQ methods typically exhibit non-trivial performance loss. We find that the performance bottleneck stems from over-consideration of hardware compatibility in the quantization process, compelling them to reluctantly employ simple quantizers, albeit at the expense of accuracy. With the above insights, we propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm to address the above issues. RepQuant employs complex quantizers in the quantization process and simplified quantizers in the inference process, and performs mathematically equivalent transformations between the two through quantization scale reparameterization, thus ensuring both accurate quantization and efficient inference. More specifically, we focus on two components with extreme distributions: LayerNorm activations and Softmax activations. Initially, we apply channel-wise quantization and log$\sqrt{2}$ quantization, respectively, which are tailored to their distributions. In particular, for the former, we introduce a learnable per-channel dual clipping scheme, which is designed to efficiently identify outliers in the unbalanced activations with fine granularity. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference. Moreover, quantized weight reconstruction is seamlessly integrated into the above procedure to further push the performance limits. Extensive experiments are performed on different large-scale transformer variants on multiple tasks, including vision, language, and multi-modal transformers, and RepQuant encouragingly demonstrates significant performance advantages.

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

OutlierTune: Efficient Channel-Wise Quantization for Large Language Models

BERT Busters: Outlier Dimensions that Disrupt Transformers

Is It a Free Lunch for Removing Outliers during Pretraining?

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

FrameQuant: Flexible Low-Bit Quantization for Transformers

Understanding and Minimising Outlier Features in Neural Network Training

Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models

Unified Normalization for Accelerating and Stabilizing Transformers

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Mitigating Outlier Activations in Low-Precision Fine-Tuning of Language Models

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

Outlier-Aware Training for Low-Bit Quantization of Structural Re-Parameterized Networks

Taming Sensitive Weights : Noise Perturbation Fine-tuning for Robust LLM Quantization