Abstract:Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation ability. Conversely, mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically, in the first stage, all potential bit-width configurations are coupled and thus optimized simultaneously within a set of shared weights. However, our observations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization, leading to considerable performance degradation under a high compression ratio. To tackle this problem, we first design a bit-width scheduler to dynamically freeze the most turbulent bit-width of layers during training, to ensure the rest bit-widths converged properly. Then, taking inspiration from information theory, we present an information distortion mitigation technique to align the behavior of the bad-performing bit-widths to the well-performing ones. In the second stage, an inference-only greedy search scheme is devised to evaluate the goodness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effectiveness of the proposed method. Code can be available on \href{<a class="link-external link-https" href="https://www.github.com/1hunters/retraining-free-quantization" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/1hunters/retraining-free-quantization" rel="external noopener nofollow">this https URL</a>}.

LKBQ: Pushing the Limit of Post-Training Quantization to Extreme 1 Bit

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

PTQ-SL: Exploring the Sub-layerwise Post-training Quantization

Post-training Quantization or Quantization-aware Training? That is the Question

RAPQ: Rescuing Accuracy for Power-of-Two Low-bit Post-training Quantization

Stabilized Activation Scale Estimation for Precise Post-Training Quantization

Optimization-based Post-training Quantization with Bit-split and Stitching

COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization

PD-Quant: Post-Training Quantization Based on Prediction Difference Metric

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Attention-aware Post-training Quantization without Backpropagation

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

EasyQuant: Post-training Quantization via Scale Optimization

Retraining-free Model Quantization via One-Shot Weight-Coupling Learning

Improving the Post-Training Neural Network Quantization by Prepositive Feature Quantization

Iterative Training: Finding Binary Weight Deep Neural Networks with Layer Binarization

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Towards Low-Bit Quantization of Deep Neural Networks with Limited Data.