Abstract:Low-bit quantization emerges as one of the most promising compression approaches for deploying deep neural networks on edge devices. Mixed-precision quantization leverages a mixture of bit-widths to unleash the accuracy and efficiency potential of quantized models. However, existing mixed-precision quantization methods rely on simulations in high-performance devices to achieve accuracy and efficiency trade-offs in immense search spaces. This leads to a non-negligible gap between the estimated efficiency metrics and the actual hardware that makes quantized models far away from the optimal accuracy and efficiency, and also causes the quantization process to rely on additional high-performance devices. In this paper, we propose an On-Chip Hardware-Aware Quantization (OHQ) framework, performing hardware-aware mixed-precision quantization on deployed edge devices to achieve accurate and efficient computing. Specifically, for efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator and avoid optimization errors caused by inaccurate simulation. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario, getting rid of the dependence of the quantization process on high computing power. By synthesizing insights from quantized models and hardware through linear optimization, we can obtain optimized bit-width configurations to achieve outstanding performance on accuracy and efficiency. We evaluate inference accuracy and acceleration with quantization for various architectures and compression ratios on hardware. OHQ achieves 70% and 73% accuracy for ResNet-18 and MobileNetV3, respectively, and can reduce latency by 15~30% compared to INT8 on real deployment.

Mixed-precision Deep Neural Network Quantization With Multiple Compression Rates

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Propagating Asymptotic-Estimated Gradients for Low Bitwidth Quantized Neural Networks

Deep Neural Network Compression With Single and Multiple Level Quantization

DPQ: dynamic pseudo-mean mixed-precision quantization for pruned neural network

Joint Optimization of Dimension Reduction and Mixed-Precision Quantization for Activation Compression of Neural Networks

Mixed-Precision Neural Network Quantization Via Learned Layer-Wise Importance

Instance-Aware Dynamic Neural Network Quantization

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

Adaptive Layerwise Quantization for Deep Neural Network Compression

One Model for All Quantization: A Quantized Network Supporting Hot-Swap Bit-Width Adjustment

Distribution-aware Adaptive Multi-bit Quantization

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Residual Quantization for Low Bit-Width Neural Networks

Quantization Networks

Efficient Bitwidth Search for Practical Mixed Precision Neural Network

Structured Dynamic Precision for Deep Neural Networks Quantization

Direct Quantization for Training Highly Accurate Low Bit-width Deep Neural Networks

Joint Pruning and Channel-Wise Mixed-Precision Quantization for Efficient Deep Neural Networks