Abstract:Due to less memory requirement, low computation overhead and negligible accuracy degradation, deep neural networks with binary/ternary weights (BTNNs) have been widely employed on low-power mobile and Internet of Things (IoT) devices with limited storage capacity. Some hardware implementations have been proposed to accelerate the inference of BTNNs by utilizing the multiplication-free feature. However, some implicit characteristics in BTNN convolution, such as high arithmetic complexity and numerous redundant operations, are never considered. In this paper, we propose four optimization techniques to fully exploit these features. First, a feature-integral-based convolution (FIBC) method is proposed to reduce the arithmetic complexity of convolutional layers. Second, a kernel-transformation-feature-reconstruction (KTFR) convolution method is presented to remove redundant operations in BTNN convolution. Third, a hierarchical load-balancing mechanism (HLBM) is designed to eliminate zero value computation and improve resource utilization. Finally, a joint optimization approach for convolutional layers is proposed to search optimal calculation pattern for each layer. Based on the proposed four techniques, we design a reconfigurable processor in a 28-nm CMOS technology to accelerate the inferences of BTNNs. The four proposed techniques improve energy efficiency by 2.07 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.65 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.25 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , and 2.24 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> for BTNNs respectively, compared with the baseline implementation which disables the proposed techniques. Benchmarked with binary-weight AlexNet, the processor achieves an energy efficiency of 19.9 TOPS/W at 200 MHz and 0.9 V.

A Low-Cost Reconfigurable Nonlinear Core for Embedded DNN Applications

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

DaDianNao: A Machine-Learning Supercomputer

Reconfigurable Multifunction Computing Unit Using an Universal Piecewise Linear Method

NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

Accelerating Low Bit-Width Convolutional Neural Networks with Embedded FPGA.

XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/op Binary Neural Network Inference

Resource-constrained FPGA/DNN co-design

A 16 nJ/Classification FPGA-Based Wired-Logic DNN Accelerator Using Fixed-Weight Non-Linear Neural Net

A DNN Optimization Framework with Unlabeled Data for Efficient and Accurate Reconfigurable Hardware Inference

Efficient FPGA Implementation of softmax Layer in Deep Neural Network

An Energy-Efficient Deep Belief Network Processor Based on Heterogeneous Multi-Core Architecture With Transposable Memory and On-Chip Learning

ONE-SA: Enabling Nonlinear Operations in Systolic Arrays for Efficient and Flexible Neural Network Inference

An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width

LiCo-Net: Linearized Convolution Network for Hardware-efficient Keyword Spotting

A Cost-Efficient High-Speed VLSI Architecture for Spiking Convolutional Neural Network Inference Using Time-Step Binary Spike Maps

A Precision-Scalable RISC-V DNN Processor with On-Device Learning Capability at the Extreme Edge

A Configurable Nonlinear Operation Unit for Neural Network Accelerator

LUTNet: Learning FPGA Configurations for Highly Efficient Neural Network Inference