Abstract:Low-bit quantization emerges as one of the most promising compression approaches for deploying deep neural networks on edge devices. Mixed-precision quantization leverages a mixture of bit-widths to unleash the accuracy and efficiency potential of quantized models. However, existing mixed-precision quantization methods rely on simulations in high-performance devices to achieve accuracy and efficiency trade-offs in immense search spaces. This leads to a non-negligible gap between the estimated efficiency metrics and the actual hardware that makes quantized models far away from the optimal accuracy and efficiency, and also causes the quantization process to rely on additional high-performance devices. In this paper, we propose an On-Chip Hardware-Aware Quantization (OHQ) framework, performing hardware-aware mixed-precision quantization on deployed edge devices to achieve accurate and efficient computing. Specifically, for efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator and avoid optimization errors caused by inaccurate simulation. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario, getting rid of the dependence of the quantization process on high computing power. By synthesizing insights from quantized models and hardware through linear optimization, we can obtain optimized bit-width configurations to achieve outstanding performance on accuracy and efficiency. We evaluate inference accuracy and acceleration with quantization for various architectures and compression ratios on hardware. OHQ achieves 70% and 73% accuracy for ResNet-18 and MobileNetV3, respectively, and can reduce latency by 15~30% compared to INT8 on real deployment.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the challenges of low-bit quantization and mixed-precision quantization when deploying deep neural networks (DNNs) on edge devices. Specifically, existing mixed-precision quantization methods rely on simulations on high-performance devices to achieve a balance between accuracy and efficiency, which leads to a significant gap between the estimated efficiency metrics and the actual hardware, making the quantized models far from optimal in terms of accuracy and efficiency. Moreover, this quantization process also depends on additional high-performance devices. To solve these problems, the authors propose a framework called "On-Chip Hardware-Aware Quantization" (OHQ). The OHQ framework enables hardware-aware mixed-precision quantization on deployed edge devices, achieving accurate and efficient computation. Specifically, OHQ achieves this goal through the following two main techniques: 1. **On-Chip Quantization Awareness (OQA)**: - Constructs an on-chip quantization awareness pipeline, allowing the quantization process to be aware of the actual hardware efficiency metrics, avoiding optimization errors caused by inaccurate simulations. 2. **Mask-Guided Quantization Estimation (MQE)**: - Proposes a mask-guided quantization estimation technique that effectively estimates the accuracy impact of operators in on-chip scenarios, eliminating the dependence on high computational power. By integrating insights from quantized models and hardware through linear optimization, OHQ can obtain optimized bit-width configurations, achieving excellent performance in terms of accuracy and efficiency. The entire quantization process is conducted completely on-chip, without the need for additional devices and data access. ### Experimental Results Experimental results show that OHQ outperforms existing mixed-precision quantization methods in terms of inference accuracy and acceleration performance across various architectures and compression ratios. For example, on ResNet-18 and MobileNetV3, OHQ achieves 70% and 73% accuracy, respectively, and reduces latency by 15% to 30% compared to INT8 quantization. These results demonstrate the significant advantages of OHQ in practical applications.

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Hardware-Centric AutoML for Mixed-Precision Quantization

Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip

Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing

Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge

AMED: Automatic Mixed-Precision Quantization for Edge Devices

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

HAWQV3: Dyadic Neural Network Quantization

Exploiting Retraining-Based Mixed-Precision Quantization for Low-Cost DNN Accelerator Design

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Adaptive quantization with mixed-precision based on low-cost proxy

LSFQ: A Low Precision Full Integer Quantization for High-Performance FPGA-Based CNN Acceleration

Joint Pruning and Channel-Wise Mixed-Precision Quantization for Efficient Deep Neural Networks

Leveraging Automated Mixed-Low-Precision Quantization for tiny edge microcontrollers

Hardware-friendly Deep Learning by Network Quantization and Binarization

Mixed-Precision Neural Network Quantization Via Learned Layer-Wise Importance