Abstract:Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation ability. Conversely, mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically, in the first stage, all potential bit-width configurations are coupled and thus optimized simultaneously within a set of shared weights. However, our observations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization, leading to considerable performance degradation under a high compression ratio. To tackle this problem, we first design a bit-width scheduler to dynamically freeze the most turbulent bit-width of layers during training, to ensure the rest bit-widths converged properly. Then, taking inspiration from information theory, we present an information distortion mitigation technique to align the behavior of the bad-performing bit-widths to the well-performing ones. In the second stage, an inference-only greedy search scheme is devised to evaluate the goodness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effectiveness of the proposed method. Code can be available on \href{<a class="link-external link-https" href="https://www.github.com/1hunters/retraining-free-quantization" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/1hunters/retraining-free-quantization" rel="external noopener nofollow">this https URL</a>}.

Post-Training Quantization for Re-parameterization via Coarse & Fine Weight Splitting

Post-training quantization for re-parameterization via coarse & fine weight splitting

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Optimizing Quantized Neural Networks in a Weak Curvature Manifold

Optimization-based Post-training Quantization with Bit-split and Stitching

Loss Aware Post-training Quantization

Make RepVGG Greater Again: A Quantization-aware Approach

Retraining-free Model Quantization via One-Shot Weight-Coupling Learning

Quantization Networks

Residual Quantization for Low Bit-Width Neural Networks

Post-Training Non-Uniform Quantization for Convolutional Neural Networks

Hybrid Post-Training Quantization for Super-Resolution Neural Network Compression

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

Error-aware Quantization through Noise Tempering

Effective Quantization Methods for Recurrent Neural Networks

Instance-Aware Dynamic Neural Network Quantization

Efficient Adaptive Activation Rounding for Post-Training Quantization

Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights

Alternating Multi-bit Quantization for Recurrent Neural Networks