Abstract:CNN model computation on edge devices is tightly restricted to the limited resource and power budgets, which motivates the low-bit quantization technology to compress CNN models into 4-bit or lower format to reduce the model size and increase hardware efficiency. Most current low-bit quantization methods use uniform quantization that maps weight and activation values onto evenly-distributed levels, which usually results in accuracy loss due to distribution mismatch. Meanwhile, some non-uniform quantization methods propose specialized representation that can better match various distribution shapes but are usually difficult to be efficiently accelerated on hardware. In order to achieve low-bit quantization with high accuracy and hardware efficiency, this paper proposes Universal Power-of-Two (UPoT), a novel low-bit quantization method that represents values as the addition of multiple power-of-two values selected from a series of subsets. By updating the subset contents, UPoT can provide adaptive quantization levels for various distributions. For each CNN model layer, UPoT automatically searches for the optimized distribution that minimizes the quantization error. Moreover, we design an efficient accelerator system with specifically optimized power-of-two multipliers and requantization units. Evaluations show that the proposed architecture can provide high-performance CNN inference with reduced circuit area and energy, and outperforms several mainstream CNN accelerators with higher ( $8\times $ – $65\times $ ) area efficiency and ( $2\times $ – $19\times $ ) energy efficiency. Further experiments of 4/3/2-bit quantization on ResNet18/50, MobileNet_V2 and EfficientNet models show that our UPoT can achieve high model accuracy which greatly outperform other state-of-the-art low-bit quantization methods by 0.3%–6%. The results indicate that our approach provides a highly-efficient accelerator for low-bit CNN model quantization with low hardware overheads and good model accuracy.

Uni-OPU: an FPGA-Based Uniform Accelerator for Convolutional and Transposed Convolutional Networks

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

UniCNN: A Pipelined Accelerator Towards Uniformed Computing for CNNs

Flexible and Efficient Convolutional Acceleration on Unified Hardware Using the Two-Stage Splitting Method and Layer-Adaptive Allocation of 1-D/2-D Winograd Units

LTrans-OPU: A Low-Latency FPGA-Based Overlay Processor for Transformer Networks

Unified Accelerator for Attention and Convolution in Inference Based on FPGA

Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal Balance

An Intermediate-Centric Dataflow for Transposed Convolution Acceleration on FPGA

FET-OPU: A Flexible and Efficient FPGA-Based Overlay Processor for Transformer Networks

7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS

HiKonv: Maximizing the Throughput of Quantized Convolution With Novel Bit-wise Management and Computation

CUTE: A scalable CPU-Centric and ultra-utilized tensor engine for convolutions

OctCNN: A High Throughput FPGA Accelerator for CNNs using Octave Convolution Algorithm

An Energy-and-Area-Efficient CNN Accelerator for Universal Powers-of-Two Quantization.

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

A Flexible and Efficient FPGA Accelerator for Various Large-Scale and Lightweight CNNs

FPGA-based Accelerator for Convolutional Neural Network

A High-performance Hardware Accelerator Using a Fusion Approach of Convolution and Pooling

FTConv: FPGA Acceleration for Transposed Convolution Layers in Deep Neural Networks

A High-Efficient and Configurable Hardware Accelerator for Convolutional Neural Network