Abstract:As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of 15.50× to 47.67× over the baseline RV64IMA core when running a single convolution layer kernel and achieves up to 2.86 GOPS performance. This work also achieves an energy efficiency at 20.51 TOPS/W, which not only exceeds contemporary state-of-the-art MPQ hardware solutions at the edge, but also marks a significant advancement in the field. We also propose a novel MPQ search algorithm that incorporates both hardware awareness and training necessity. The algorithm samples layer-wise sensitivities using a set of newly proposed metrics and runs a heuristics search. Evaluation results show that this search algorithm achieves 2.2%∼6.7% higher inference accuracy under similar hardware constraints compared to state-of-the-art MPQ strategies. Furthermore we expand the search space using a dynamic programming (DP) strategy to perform search with more fine-grained accuracy intervals and support multi-dimensional search. This further improves the inference accuracy by over 1.3% compared to a greedy-based search.

MCU-MixQ: A HW/SW Co-optimized Mixed-precision Neural Network Design Framework for MCUs

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

DeepBurning-MixQ: An Open Source Mixed-Precision Neural Network Accelerator Design Framework for FPGAs

MixMixQ: Quantization with Mixed Bit-Sparsity and Mixed Bit-Width for CIM Accelerators

CMix-NN: Mixed Low-Precision CNN Library for Memory-Constrained Edge Devices

A Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference

Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing

CIM2PQ: an Array-Wise and Hardware-Friendly Mixed Precision Quantization Method for Analog Computing-In-Memory

Leveraging Automated Mixed-Low-Precision Quantization for tiny edge microcontrollers

MSD: Mixing Signed Digit Representations for Hardware-efficient DNN Acceleration on FPGA with Heterogeneous Resources.

Value-Driven Mixed-Precision Quantization for Patch-Based Inference on Microcontrollers

Enabling Mixed-Precision Quantized Neural Networks in Extreme-Edge Devices

CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory Based Neural Network Accelerators

GroupQ: Group-Wise Quantization With Multi-Objective Optimization for CNN Accelerators

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations

ES-MPQ: Evolutionary Search enabled Mixed Precision Quantization Framework for Computing-In-Memory

An Energy Efficient Soft SIMD Microarchitecture and Its Application on Quantized CNNs