Abstract:In this paper, we propose a generalizable mixed-precision quantization (GMPQ) method for efficient inference. Conventional methods require the consistency of datasets for bitwidth search and model deployment to guarantee the policy optimality, leading to heavy search cost on challenging large-scale datasets in realistic applications. On the contrary, our GMPQ searches the mixed-quantization policy that can be generalized to large-scale datasets with only a small amount of data, so that the search cost is significantly reduced without performance degradation. Specifically, we observe that locating network attribution correctly is general ability for accurate visual analysis across different data distribution. Therefore, despite of pursuing higher accuracy and lower model complexity, we preserve attribution rank consistency between the quantized models and their full-precision counterparts via capacity-aware attribution imitation for generalizable mixed-precision quantization strategy search, where the capacity of quantized networks is considered to fully utilize the network capacity without insufficiency. Since slight noise in attribution is amplified by discrete ranking operations with significant rank errors, mimicking the attribution ranks of the full-precision models obstructs the quantized networks to correctly locate the attribution. To address this, we further present a robust generalizable mixed-precision quantization method to smooth the attribution for rank error alleviation by hierarchical attribution partitioning, which efficiently partitions the attribution pixels in high spatial resolution and assigns the same attribution value for pixels within a group. Moreover, we propose dynamic capacity-aware attribution imitation to adjust the concentration degree of the attribution according to sample hardness, so that sufficient model capacity is achieved with full utilization for each image. Extensive experiments on image classification and object detection show that our GMPQ and R-GMPQ obtain competitive accuracy-complexity trade-offs with significantly reduced search cost compared to the state-of-the-art mixed-precision networks.

Learning Generalizable Mixed-Precision Quantization via Attribution Imitation

Generalizable Mixed-Precision Quantization via Attribution Rank Preservation

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Optimizing Quantized Neural Networks in a Weak Curvature Manifold

Mixed-Precision Neural Network Quantization Via Learned Layer-Wise Importance

Mixed-Precision Quantization with Cross-Layer Dependencies

BMPQ: Bit-Gradient Sensitivity Driven Mixed-Precision Quantization of DNNs from Scratch

SEAM: Searching Transferable Mixed-Precision Quantization Policy Through Large Margin Regularization

Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip

Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation

CSMPQ:Class Separability Based Mixed-Precision Quantization

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

MPQ-Diff: Mixed Precision Quantization for Diffusion Models

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

Patch-wise Mixed-Precision Quantization of Vision Transformer

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

Towards a tailored mixed-precision sub-8-bit quantization scheme for Gated Recurrent Units using Genetic Algorithms

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Error-aware Quantization through Noise Tempering

A Near Lossless Learned Image Coding Network Quantization Approach for Cross-Platform Inference.