Abstract:Large transformer models have demonstrated remarkable success. Post-training quantization (PTQ), which requires only a small dataset for calibration and avoids end-to-end retraining, is a promising solution for compressing these large models. Regrettably, existing PTQ methods typically exhibit non-trivial performance loss. We find that the performance bottleneck stems from over-consideration of hardware compatibility in the quantization process, compelling them to reluctantly employ simple quantizers, albeit at the expense of accuracy. With the above insights, we propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm to address the above issues. RepQuant employs complex quantizers in the quantization process and simplified quantizers in the inference process, and performs mathematically equivalent transformations between the two through quantization scale reparameterization, thus ensuring both accurate quantization and efficient inference. More specifically, we focus on two components with extreme distributions: LayerNorm activations and Softmax activations. Initially, we apply channel-wise quantization and log$\sqrt{2}$ quantization, respectively, which are tailored to their distributions. In particular, for the former, we introduce a learnable per-channel dual clipping scheme, which is designed to efficiently identify outliers in the unbalanced activations with fine granularity. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference. Moreover, quantized weight reconstruction is seamlessly integrated into the above procedure to further push the performance limits. Extensive experiments are performed on different large-scale transformer variants on multiple tasks, including vision, language, and multi-modal transformers, and RepQuant encouragingly demonstrates significant performance advantages.

RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization

Towards Accurate Post-training Quantization for Reparameterized Models

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

Outlier-aware Slicing for Post-Training Quantization in Vision Transformer

FrameQuant: Flexible Low-Bit Quantization for Transformers

AffineQuant: Affine Transformation Quantization for Large Language Models

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Towards Accurate Post-Training Quantization for Vision Transformer

PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization

CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

RPTQ: Reorder-based Post-training Quantization for Large Language Models

Post-training Quantization or Quantization-aware Training? That is the Question

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent

COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization

Post Training Quantization of Large Language Models with Microscaling Formats

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

PD-Quant: Post-Training Quantization Based on Prediction Difference Metric