Abstract:Due to less memory requirement, low computation overhead and negligible accuracy degradation, deep neural networks with binary/ternary weights (BTNNs) have been widely employed on low-power mobile and Internet of Things (IoT) devices with limited storage capacity. Some hardware implementations have been proposed to accelerate the inference of BTNNs by utilizing the multiplication-free feature. However, some implicit characteristics in BTNN convolution, such as high arithmetic complexity and numerous redundant operations, are never considered. In this paper, we propose four optimization techniques to fully exploit these features. First, a feature-integral-based convolution (FIBC) method is proposed to reduce the arithmetic complexity of convolutional layers. Second, a kernel-transformation-feature-reconstruction (KTFR) convolution method is presented to remove redundant operations in BTNN convolution. Third, a hierarchical load-balancing mechanism (HLBM) is designed to eliminate zero value computation and improve resource utilization. Finally, a joint optimization approach for convolutional layers is proposed to search optimal calculation pattern for each layer. Based on the proposed four techniques, we design a reconfigurable processor in a 28-nm CMOS technology to accelerate the inferences of BTNNs. The four proposed techniques improve energy efficiency by 2.07 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.65 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.25 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , and 2.24 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> for BTNNs respectively, compared with the baseline implementation which disables the proposed techniques. Benchmarked with binary-weight AlexNet, the processor achieves an energy efficiency of 19.9 TOPS/W at 200 MHz and 0.9 V.

A 68 mw 2.2 Tops/w low bit-width and multiplierless DCNN object detection processor for visually impaired people

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

DaDianNao: A Machine-Learning Supercomputer

A 11.6μ W Computing-on-Memory-Boundary Keyword Spotting Processor with Joint MFCC-CNN Ternary Quantization

A 0.99-to-4.38 Uj/class Event-Driven Hybrid Neural Network Processor for Full-Spectrum Neural Signal Analyses.

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

An Ultra-Low Power Binarized Convolutional Neural Network-Based Speech Recognition Processor with On-Chip Self-Learning.

Audio and Image Cross-Modal Intelligence Via a 10TOPS/W 22nm SoC with Back-Propagation and Dynamic Power Gating

Neural Synaptic Plasticity-Inspired Computing: A High Computing Efficient Deep Convolutional Neural Network Accelerator

A Power-Efficient Programmable DCNN Processor for Intelligent Sensing

A Time-Domain Computing-in-Memory Based Processor Using Predictable Decomposed Convolution for Arbitrary Quantized DNNs

An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width

An Ultra-low Power TinyML System for Real-time Visual Processing at Edge

Processing Near Sensor Architecture in Mixed-Signal Domain with CMOS Image Sensor of Convolutional-Kernel-Readout Method

An Energy-Efficient Convolutional Neural Network Processor Architecture Based on a Systolic Array

A 22nm 3.5TOPS/W Flexible Micro-Robotic Vision SoC with 2MB Emram for Fully-on-Chip Intelligence.

A Low-Power Sparse Convolutional Neural Network Accelerator with Pre-Encoding Radix-4 Booth Multiplier

An Energy-Efficient ECG Processor with Ultra-Low-Parameter Multi-Stage Neural Network and Optimized Power-of-Two Quantization

MACSen: A Processing-In-Sensor Architecture Integrating MAC Operations Into Image Sensor for Ultra-Low-Power BNN-Based Intelligent Visual Perception

Digital-analog hybrid matrix multiplication processor for optical neural networks

A Communication-Aware DNN Accelerator on ImageNet Using In-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65-nm CMOS