Abstract:As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of 15.50× to 47.67× over the baseline RV64IMA core when running a single convolution layer kernel and achieves up to 2.86 GOPS performance. This work also achieves an energy efficiency at 20.51 TOPS/W, which not only exceeds contemporary state-of-the-art MPQ hardware solutions at the edge, but also marks a significant advancement in the field. We also propose a novel MPQ search algorithm that incorporates both hardware awareness and training necessity. The algorithm samples layer-wise sensitivities using a set of newly proposed metrics and runs a heuristics search. Evaluation results show that this search algorithm achieves 2.2%∼6.7% higher inference accuracy under similar hardware constraints compared to state-of-the-art MPQ strategies. Furthermore we expand the search space using a dynamic programming (DP) strategy to perform search with more fine-grained accuracy intervals and support multi-dimensional search. This further improves the inference accuracy by over 1.3% compared to a greedy-based search.

A 119.64 GOPs/W FPGA-Based ResNet50 Mixed-Precision Accelerator Using the Dynamic DSP Packing

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

DeepBurning-MixQ: An Open Source Mixed-Precision Neural Network Accelerator Design Framework for FPGAs

Optimizing FPGA-Based DNN Accelerator with Shared Exponential Floating-Point Format

DPQ: dynamic pseudo-mean mixed-precision quantization for pruned neural network

Exploiting Retraining-Based Mixed-Precision Quantization for Low-Cost DNN Accelerator Design

Structured Dynamic Precision for Deep Neural Networks Quantization

Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing

DSP-Packing: Squeezing Low-precision Arithmetic into FPGA DSP Blocks

PIR-DSP: an FPGA DSP Block Architecture for Multi-precision Deep Neural Networks

Design of High-Throughput Mixed-Precision CNN Accelerators on FPGA

Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Mixed Precision Quantized Neural Network Accelerator for Remote Sensing Images Classification

PRArch: Pattern-Based Reconfigurable Architecture for Deep Neural Network Acceleration

Custom Network Quantization Method for Lightweight CNN Acceleration on FPGAs

A 112-765 GOPS/W FPGA-based CNN Accelerator Using Importance Map Guided Adaptive Activation Sparsification for Pix2pix Applications

An Efficient CNN Accelerator Achieving High PE Utilization Using a Dense-/Sparse-Aware Redundancy Reduction Method and Data–Index Decoupling Workflow

GroupQ: Group-Wise Quantization With Multi-Objective Optimization for CNN Accelerators