Abstract:Low-bit quantization emerges as one of the most promising compression approaches for deploying deep neural networks on edge devices. Mixed-precision quantization leverages a mixture of bit-widths to unleash the accuracy and efficiency potential of quantized models. However, existing mixed-precision quantization methods rely on simulations in high-performance devices to achieve accuracy and efficiency trade-offs in immense search spaces. This leads to a non-negligible gap between the estimated efficiency metrics and the actual hardware that makes quantized models far away from the optimal accuracy and efficiency, and also causes the quantization process to rely on additional high-performance devices. In this paper, we propose an On-Chip Hardware-Aware Quantization (OHQ) framework, performing hardware-aware mixed-precision quantization on deployed edge devices to achieve accurate and efficient computing. Specifically, for efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator and avoid optimization errors caused by inaccurate simulation. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario, getting rid of the dependence of the quantization process on high computing power. By synthesizing insights from quantized models and hardware through linear optimization, we can obtain optimized bit-width configurations to achieve outstanding performance on accuracy and efficiency. We evaluate inference accuracy and acceleration with quantization for various architectures and compression ratios on hardware. OHQ achieves 70% and 73% accuracy for ResNet-18 and MobileNetV3, respectively, and can reduce latency by 15~30% compared to INT8 on real deployment.

Phoenix: A Low-Precision Floating-Point Quantization Oriented Architecture for Convolutional Neural Networks

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A Reconfigurable Approximate Multiplier for Quantized CNN Applications.

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

LSFQ: A Low Precision Full Integer Quantization for High-Performance FPGA-Based CNN Acceleration

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device

Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks

Optimal Architecture of Floating-Point Arithmetic for Neural Network Training Processors

FPGA-Based Hybrid-Type Implementation of Quantized Neural Networks for Remote Sensing Applications

High-Performance FPGA-Based CNN Accelerator with Block-Floating-Point Arithmetic.

A Hardware-Friendly Low-Bit Power-of-Two Quantization Method for CNNs and Its FPGA Implementation

Custom Network Quantization Method for Lightweight CNN Acceleration on FPGAs

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

A Block-Floating-Point Arithmetic Based FPGA Accelerator for Convolutional Neural Networks

GroupQ: Group-Wise Quantization With Multi-Objective Optimization for CNN Accelerators

A hardware-friendly logarithmic quantization method for CNNs and FPGA implementation