Abstract:Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation ability. Conversely, mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically, in the first stage, all potential bit-width configurations are coupled and thus optimized simultaneously within a set of shared weights. However, our observations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization, leading to considerable performance degradation under a high compression ratio. To tackle this problem, we first design a bit-width scheduler to dynamically freeze the most turbulent bit-width of layers during training, to ensure the rest bit-widths converged properly. Then, taking inspiration from information theory, we present an information distortion mitigation technique to align the behavior of the bad-performing bit-widths to the well-performing ones. In the second stage, an inference-only greedy search scheme is devised to evaluate the goodness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effectiveness of the proposed method. Code can be available on \href{<a class="link-external link-https" href="https://www.github.com/1hunters/retraining-free-quantization" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/1hunters/retraining-free-quantization" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reduce the performance degradation after model compression in Mixed - Precision Quantization (MPQ), and proposes a model quantization method without retraining. Specifically, the paper focuses on how to compress deep neural network models through effective mixed - precision quantization while maintaining the performance of the models when deploying them on resource - limited devices. Traditional methods usually require a large amount of retraining after searching for the optimal bit - width configuration, which not only consumes time but also increases the cost of deployment. This paper proposes a new "one - time training - search" paradigm, aiming to achieve efficient mixed - precision model compression through the optimization strategy of sharing weights, thus avoiding the need for retraining. The key contributions of the paper are as follows: 1. **Identify and analyze the bit - width interference problem**: In the quantization model with shared weights, the high coupling between different bit - widths will lead to training instability and performance degradation. The paper discusses this problem in detail and reveals its impact on optimization challenges, training stability and convergence. 2. **Design a bit - width scheduler**: By dynamically freezing the bit - widths that cause weight interference, ensuring that other bit - widths can converge normally, solving the training instability problem caused by the introduction of additional bit - widths. 3. **Propose an information distortion mitigation technique**: Inspired by information theory, align the behaviors of poorly performing bit - widths to alleviate the information distortion phenomenon in dynamic training and further improve the overall performance of the model. 4. **Propose an inference - only greedy search scheme**: In the second stage, evaluate the quality of different configurations through a two - way greedy search algorithm without introducing additional training costs. The experimental results show the effectiveness of this method on three representative models (ResNet18, MobileNetV2 and EfficientNetLite - B0) and three datasets (ImageNet, Pets and CIFAR100). For example, under the average 4 - bit constraint, this method achieves a maximum accuracy of 71.0% on ResNet with only 31.6G of BitOPs, without any retraining at all, which has significant advantages compared to other methods.

Retraining-free Model Quantization via One-Shot Weight-Coupling Learning

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

One Model for All Quantization: A Quantized Network Supporting Hot-Swap Bit-Width Adjustment

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

Post-Training Quantization for Re-parameterization via Coarse & Fine Weight Splitting

Post-training Quantization or Quantization-aware Training? That is the Question

SearchQ: Search-based Fine-Grained Quantization for Data-Free Model Compression

PTMQ: Post-training Multi-Bit Quantization of Neural Networks

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

Mixed-Precision Neural Network Quantization Via Learned Layer-Wise Importance

Optimization-based Post-training Quantization with Bit-split and Stitching

Automatic low-bit hybrid quantization of neural networks through meta learning

Error-aware Quantization through Noise Tempering

OMPQ: Orthogonal Mixed Precision Quantization

Mixed-precision Deep Neural Network Quantization With Multiple Compression Rates

AutoMPQ: Automatic Mixed-Precision Neural Network Search Via Few-Shot Quantization Adapter

Post-Training Quantization for Cross-Platform Learned Image Compression

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

CSMPQ: Class Separability Based Mixed-Precision Quantization.