Abstract:In this paper, we propose a generalizable mixed-precision quantization (GMPQ) method for efficient inference. Conventional methods require the consistency of datasets for bitwidth search and model deployment to guarantee the policy optimality, leading to heavy search cost on challenging large-scale datasets in realistic applications. On the contrary, our GMPQ searches the mixed-quantization policy that can be generalized to large-scale datasets with only a small amount of data, so that the search cost is significantly reduced without performance degradation. Specifically, we observe that locating network attribution correctly is general ability for accurate visual analysis across different data distribution. Therefore, despite of pursuing higher accuracy and lower model complexity, we preserve attribution rank consistency between the quantized models and their full-precision counterparts via capacity-aware attribution imitation for generalizable mixed-precision quantization strategy search, where the capacity of quantized networks is considered to fully utilize the network capacity without insufficiency. Since slight noise in attribution is amplified by discrete ranking operations with significant rank errors, mimicking the attribution ranks of the full-precision models obstructs the quantized networks to correctly locate the attribution. To address this, we further present a robust generalizable mixed-precision quantization method to smooth the attribution for rank error alleviation by hierarchical attribution partitioning, which efficiently partitions the attribution pixels in high spatial resolution and assigns the same attribution value for pixels within a group. Moreover, we propose dynamic capacity-aware attribution imitation to adjust the concentration degree of the attribution according to sample hardness, so that sufficient model capacity is achieved with full utilization for each image. Extensive experiments on image classification and object detection show that our GMPQ and R-GMPQ obtain competitive accuracy-complexity trade-offs with significantly reduced search cost compared to the state-of-the-art mixed-precision networks.

SEAM: Searching Transferable Mixed-Precision Quantization Policy Through Large Margin Regularization

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Mixed-Precision Neural Network Quantization Via Learned Layer-Wise Importance

Generalizable Mixed-Precision Quantization via Attribution Rank Preservation

CSMPQ:Class Separability Based Mixed-Precision Quantization

EMQ: Evolving Training-free Proxies for Automated Mixed Precision Quantization

Mixed-Precision Quantization: Make the Best Use of Bits Where They Matter Most

Data Quality-aware Mixed-precision Quantization via Hybrid Reinforcement Learning

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

CSQ: Growing Mixed-Precision Quantization Scheme with Bi-level Continuous Sparsification

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Mixed-Precision Quantization with Cross-Layer Dependencies

MixQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction

FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

Channel-Wise Mixed-Precision Quantization for Large Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox

Learning Generalizable Mixed-Precision Quantization via Attribution Imitation

QuIP: 2-Bit Quantization of Large Language Models With Guarantees