Abstract:Due to less memory requirement, low computation overhead and negligible accuracy degradation, deep neural networks with binary/ternary weights (BTNNs) have been widely employed on low-power mobile and Internet of Things (IoT) devices with limited storage capacity. Some hardware implementations have been proposed to accelerate the inference of BTNNs by utilizing the multiplication-free feature. However, some implicit characteristics in BTNN convolution, such as high arithmetic complexity and numerous redundant operations, are never considered. In this paper, we propose four optimization techniques to fully exploit these features. First, a feature-integral-based convolution (FIBC) method is proposed to reduce the arithmetic complexity of convolutional layers. Second, a kernel-transformation-feature-reconstruction (KTFR) convolution method is presented to remove redundant operations in BTNN convolution. Third, a hierarchical load-balancing mechanism (HLBM) is designed to eliminate zero value computation and improve resource utilization. Finally, a joint optimization approach for convolutional layers is proposed to search optimal calculation pattern for each layer. Based on the proposed four techniques, we design a reconfigurable processor in a 28-nm CMOS technology to accelerate the inferences of BTNNs. The four proposed techniques improve energy efficiency by 2.07 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.65 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.25 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , and 2.24 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> for BTNNs respectively, compared with the baseline implementation which disables the proposed techniques. Benchmarked with binary-weight AlexNet, the processor achieves an energy efficiency of 19.9 TOPS/W at 200 MHz and 0.9 V.

A 4.69-TOPS/W Training, 2.34- $\mu$ J/Image Inference On-Chip Training Accelerator with Inference-Compatible Backpropagation and Design Space Exploration in 28-Nm CMOS

A 28nm Configurable Asynchronous SNN Accelerator with Energy-Efficient Learning

An Energy-Efficient Computing-in-Memory Neuromorphic System with On-Chip Training.

A 68 mw 2.2 Tops/w low bit-width and multiplierless DCNN object detection processor for visually impaired people

A Low-Power Sparse Convolutional Neural Network Accelerator with Pre-Encoding Radix-4 Booth Multiplier

A Scalable Multi-TeraOPS Core for AI Training and Inference

Neural Synaptic Plasticity-Inspired Computing: A High Computing Efficient Deep Convolutional Neural Network Accelerator

DQ-STP: an Efficient Sparse On-Device Training Processor Based on Low-Rank Decomposition and Quantization for DNN

7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS

An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width

Audio and Image Cross-Modal Intelligence Via a 10TOPS/W 22nm SoC with Back-Propagation and Dynamic Power Gating

TCN-CUTIE: A 1036 TOp/s/W, 2.72 uJ/Inference, 12.2 mW All-Digital Ternary Accelerator in 22 nm FDX Technology

A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

A 3D Tiled Low Power Accelerator for Convolutional Neural Network

An Emerging NVM CIM Accelerator with Shared-Path Transpose Read and Bit-Interleaving Weight Storage for Efficient On-Chip Training in Edge Devices

FantastIC4: A Hardware-Software Co-Design Approach for Efficiently Running 4bit-Compact Multilayer Perceptrons

A Low-Power Hybrid-Precision Neuromorphic Processor with INT8 Inference and INT16 Online Learning in 40-Nm CMOS

A 2.6 TOPS/W 16-Bit Fixed-Point Convolutional Neural Network Learning Processor in 65-nm CMOS

A 28nm 4.35tops/mm2 Transformer Accelerator with Basis-vector Based Ultra Storage Compression, Decomposed Computation and Unified LUT-Assisted Cores

An Ultra-High Energy-Efficient Reconfigurable Processor for Deep Neural Networks with Binary/Ternary Weights in 28NM CMOS