Abstract:Due to less memory requirement, low computation overhead and negligible accuracy degradation, deep neural networks with binary/ternary weights (BTNNs) have been widely employed on low-power mobile and Internet of Things (IoT) devices with limited storage capacity. Some hardware implementations have been proposed to accelerate the inference of BTNNs by utilizing the multiplication-free feature. However, some implicit characteristics in BTNN convolution, such as high arithmetic complexity and numerous redundant operations, are never considered. In this paper, we propose four optimization techniques to fully exploit these features. First, a feature-integral-based convolution (FIBC) method is proposed to reduce the arithmetic complexity of convolutional layers. Second, a kernel-transformation-feature-reconstruction (KTFR) convolution method is presented to remove redundant operations in BTNN convolution. Third, a hierarchical load-balancing mechanism (HLBM) is designed to eliminate zero value computation and improve resource utilization. Finally, a joint optimization approach for convolutional layers is proposed to search optimal calculation pattern for each layer. Based on the proposed four techniques, we design a reconfigurable processor in a 28-nm CMOS technology to accelerate the inferences of BTNNs. The four proposed techniques improve energy efficiency by 2.07 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.65 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.25 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , and 2.24 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> for BTNNs respectively, compared with the baseline implementation which disables the proposed techniques. Benchmarked with binary-weight AlexNet, the processor achieves an energy efficiency of 19.9 TOPS/W at 200 MHz and 0.9 V.

Base-2 Softmax Function: Suitability for Training and Efficient Hardware Implementation

High-Precision Method and Architecture for Base-2 Softmax Function in DNN Training.

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

Hardware-Efficient SoftMax Architecture With Bit-Wise Exponentiation and Reciprocal Calculation

2 β-softmax: A Hardware-Friendly Activation Function with Low Complexity and High Performance

DaDianNao: A Machine-Learning Supercomputer

Efficient FPGA Implementation of softmax Layer in Deep Neural Network

Efficient Hardware Architecture of Softmax Layer in Deep Neural Network

A High Speed SoftMax VLSI Architecture Based on Basic-Split

An Efficient Hardware Architecture with Adjustable Precision and Extensible Range to Implement Sigmoid and Tanh Functions

ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters

Design Space Exploration of Neural Network Activation Function Circuits

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

A CORDIC-Based Architecture with Adjustable Precision and Flexible Scalability to Implement Sigmoid and Tanh Functions.

Base-Reconfigurable Segmented Logarithmic Quantization and Hardware Design for Deep Neural Networks

An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width

Low Error-Rate Approximate Multiplier Design for DNNs with Hardware-Driven Co-Optimization

A Low-Power Arithmetic Element for Multi-Base Logarithmic Computation on Deep Neural Networks

<sup>2</sup>b-sigmoid and <sup>2</sup>b-tanh: Low Hardware Complexity Activation Functions for LSTM

A Compact and Configurable Long Short-Term Memory Neural Network Hardware Architecture.

TEA-S: A Tiny and Efficient Architecture for PLAC-Based Softmax in Transformers