Abstract:As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of 15.50× to 47.67× over the baseline RV64IMA core when running a single convolution layer kernel and achieves up to 2.86 GOPS performance. This work also achieves an energy efficiency at 20.51 TOPS/W, which not only exceeds contemporary state-of-the-art MPQ hardware solutions at the edge, but also marks a significant advancement in the field. We also propose a novel MPQ search algorithm that incorporates both hardware awareness and training necessity. The algorithm samples layer-wise sensitivities using a set of newly proposed metrics and runs a heuristics search. Evaluation results show that this search algorithm achieves 2.2%∼6.7% higher inference accuracy under similar hardware constraints compared to state-of-the-art MPQ strategies. Furthermore we expand the search space using a dynamic programming (DP) strategy to perform search with more fine-grained accuracy intervals and support multi-dimensional search. This further improves the inference accuracy by over 1.3% compared to a greedy-based search.

MixMixQ: Quantization with Mixed Bit-Sparsity and Mixed Bit-Width for CIM Accelerators

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory Based Neural Network Accelerators

A Reconfigurable Approximate Multiplier for Quantized CNN Applications.

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

A Reconfigurable Multiplier for Signed Multiplications with Asymmetric Bit-Widths.

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

MCU-MixQ: A HW/SW Co-optimized Mixed-precision Neural Network Design Framework for MCUs

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Scale-CIM: Precision-Scalable Computing-in-Memory for Energy-Efficient Quantized Neural Networks

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

PIM-QAT: Neural Network Quantization for Processing-In-Memory (PIM) Systems

Exploiting Retraining-Based Mixed-Precision Quantization for Low-Cost DNN Accelerator Design

GroupQ: Group-Wise Quantization With Multi-Objective Optimization for CNN Accelerators

ZEBRA: A Zero-Bit Robust-Accumulation Compute-In-Memory Approach for Neural Network Acceleration Utilizing Different Bitwise Patterns

Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators

Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing

CSMPQ:Class Separability Based Mixed-Precision Quantization

Hardware-Centric AutoML for Mixed-Precision Quantization

LSMQ: A Layer-Wise Sensitivity-Based MixedPrecision Quantization Method for Bit-Flexible CNN Accelerator