Abstract:Due to less memory requirement, low computation overhead and negligible accuracy degradation, deep neural networks with binary/ternary weights (BTNNs) have been widely employed on low-power mobile and Internet of Things (IoT) devices with limited storage capacity. Some hardware implementations have been proposed to accelerate the inference of BTNNs by utilizing the multiplication-free feature. However, some implicit characteristics in BTNN convolution, such as high arithmetic complexity and numerous redundant operations, are never considered. In this paper, we propose four optimization techniques to fully exploit these features. First, a feature-integral-based convolution (FIBC) method is proposed to reduce the arithmetic complexity of convolutional layers. Second, a kernel-transformation-feature-reconstruction (KTFR) convolution method is presented to remove redundant operations in BTNN convolution. Third, a hierarchical load-balancing mechanism (HLBM) is designed to eliminate zero value computation and improve resource utilization. Finally, a joint optimization approach for convolutional layers is proposed to search optimal calculation pattern for each layer. Based on the proposed four techniques, we design a reconfigurable processor in a 28-nm CMOS technology to accelerate the inferences of BTNNs. The four proposed techniques improve energy efficiency by 2.07 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.65 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.25 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , and 2.24 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> for BTNNs respectively, compared with the baseline implementation which disables the proposed techniques. Benchmarked with binary-weight AlexNet, the processor achieves an energy efficiency of 19.9 TOPS/W at 200 MHz and 0.9 V.

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

Accelerating Recurrent Neural Networks: A Memory-Efficient Approach

An Energy-Efficient Convolutional Neural Network Processor Architecture Based on a Systolic Array

A 510 μW 0.738-mm 2 6.2-pJ/SOP Online Learning Multi-Topology SNN Processor With Unified Computation Engine in 40-nm CMOS

A 0.99-to-4.38 Uj/class Event-Driven Hybrid Neural Network Processor for Full-Spectrum Neural Signal Analyses

OCEAN: an On-Chip Incremental-Learning Enhanced Processor with Gated Recurrent Neural Network Accelerators.

A 6.54-to-26.03 TOPS/W Computing-In-Memory RNN Processor Using Input Similarity Optimization and Attention-based Context-breaking with Output Speculation

RRAM-DNN: an RRAM and Model-Compression Empowered All-Weights-On-Chip DNN Accelerator

An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width

A 65nm 4kb Algorithm-Dependent Computing-in-Memory SRAM Unit-Macro with 2.3ns and 55.8TOPS/W Fully Parallel Product-Sum Operation for Binary DNN Edge Processors

RISC-V Toolchain and Agile Development based Open-source Neuromorphic Processor

Design of High Performance RNN Accelerator Based on Network Compression

A 41.7TOPS/W@INT8 Computing-in-Memory Processor with Zig-Zag Backbone-Systolic CIM and Block/Self-Gating CAM for NN/Recommendation Applications

1.63 Pj/sop Neuromorphic Processor with Integrated Partial Sum Routers for In-Network Computing

Energy-efficient SNN Architecture using 3nm FinFET Multiport SRAM-based CIM with Online Learning

Chipmunk: A Systolically Scalable 0.9 mm${}^2$, 3.08 Gop/s/mW @ 1.2 mW Accelerator for Near-Sensor Recurrent Neural Network Inference

Multicore Spiking Neuromorphic Chip in 180-nm with ReRAM Synapses and Digital Neurons

An Energy-Efficient Computing-in-Memory NN Processor with Set-Associate Blockwise Sparsity and Ping-Pong Weight Update

PL-NPU: an Energy-Efficient Edge-Device DNN Training Processor with Posit-Based Logarithm-Domain Computing

7.5 A 65nm 0.39-to-140.3tops/w 1-to-12b Unified Neural Network Processor Using Block-Circulant-Enabled Transpose-Domain Acceleration with 8.1 × Higher TOPS/mm2and 6T HBST-TRAM-Based 2D Data-Reuse Architecture

RNNAccel: A Fusion Recurrent Neural Network Accelerator for Edge Intelligence