Abstract:CNN model computation on edge devices is tightly restricted to the limited resource and power budgets, which motivates the low-bit quantization technology to compress CNN models into 4-bit or lower format to reduce the model size and increase hardware efficiency. Most current low-bit quantization methods use uniform quantization that maps weight and activation values onto evenly-distributed levels, which usually results in accuracy loss due to distribution mismatch. Meanwhile, some non-uniform quantization methods propose specialized representation that can better match various distribution shapes but are usually difficult to be efficiently accelerated on hardware. In order to achieve low-bit quantization with high accuracy and hardware efficiency, this paper proposes Universal Power-of-Two (UPoT), a novel low-bit quantization method that represents values as the addition of multiple power-of-two values selected from a series of subsets. By updating the subset contents, UPoT can provide adaptive quantization levels for various distributions. For each CNN model layer, UPoT automatically searches for the optimized distribution that minimizes the quantization error. Moreover, we design an efficient accelerator system with specifically optimized power-of-two multipliers and requantization units. Evaluations show that the proposed architecture can provide high-performance CNN inference with reduced circuit area and energy, and outperforms several mainstream CNN accelerators with higher ( $8\times $ – $65\times $ ) area efficiency and ( $2\times $ – $19\times $ ) energy efficiency. Further experiments of 4/3/2-bit quantization on ResNet18/50, MobileNet_V2 and EfficientNet models show that our UPoT can achieve high model accuracy which greatly outperform other state-of-the-art low-bit quantization methods by 0.3%–6%. The results indicate that our approach provides a highly-efficient accelerator for low-bit CNN model quantization with low hardware overheads and good model accuracy.

A Fully Quantitative Scheme With Fine-grained Tuning Method For Lightweight CNN Acceleration

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights

Optimizing Quantized Neural Networks in a Weak Curvature Manifold

Custom Network Quantization Method for Lightweight CNN Acceleration on FPGAs

LSFQ: A Low Precision Full Integer Quantization for High-Performance FPGA-Based CNN Acceleration

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

Improving Neural Network Efficiency Via Post-training Quantization with Adaptive Floating-Point

An Energy-and-Area-Efficient CNN Accelerator for Universal Powers-of-Two Quantization.

Low-precision CNN Model Quantization based on Optimal Scaling Factor Estimation

Fixed-point Quantization of Convolutional Neural Networks for Quantized Inference on Embedded Platforms

AQA: an Adaptive Post-Training Quantization Method for Activations of CNNs

KCNN: Kernel-wise Quantization to Remarkably Decrease Multiplications in Convolutional Neural Network.

DQI: A Dynamic Quantization Method for Efficient Convolutional Neural Network Inference Accelerators

Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

FQ-Conv: Fully Quantized Convolution for Efficient and Accurate Inference

Designing Quantizers for Low-Precision Post-Training Quantization: A Standard Pipeline Approach for CNNs

Focused Quantization for Sparse CNNs

Fully Integer-Based Quantization for Mobile Convolutional Neural Network Inference

A Closer Look at Hardware-Friendly Weight Quantization