Abstract:Due to less memory requirement, low computation overhead and negligible accuracy degradation, deep neural networks with binary/ternary weights (BTNNs) have been widely employed on low-power mobile and Internet of Things (IoT) devices with limited storage capacity. Some hardware implementations have been proposed to accelerate the inference of BTNNs by utilizing the multiplication-free feature. However, some implicit characteristics in BTNN convolution, such as high arithmetic complexity and numerous redundant operations, are never considered. In this paper, we propose four optimization techniques to fully exploit these features. First, a feature-integral-based convolution (FIBC) method is proposed to reduce the arithmetic complexity of convolutional layers. Second, a kernel-transformation-feature-reconstruction (KTFR) convolution method is presented to remove redundant operations in BTNN convolution. Third, a hierarchical load-balancing mechanism (HLBM) is designed to eliminate zero value computation and improve resource utilization. Finally, a joint optimization approach for convolutional layers is proposed to search optimal calculation pattern for each layer. Based on the proposed four techniques, we design a reconfigurable processor in a 28-nm CMOS technology to accelerate the inferences of BTNNs. The four proposed techniques improve energy efficiency by 2.07 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.65 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , 1.25 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> , and 2.24 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> for BTNNs respectively, compared with the baseline implementation which disables the proposed techniques. Benchmarked with binary-weight AlexNet, the processor achieves an energy efficiency of 19.9 TOPS/W at 200 MHz and 0.9 V.

A 22nm 54.94TFLOPS/W Transformer Fine-Tuning Processor with Exponent-Stationary Re-Computing, Aggressive Linear Fitting, and Logarithmic Domain Multiplicating

A 28nm 77.35TOPS/W Similar Vectors Traceable Transformer Processor with Principal-Component-Prior Speculating and Dynamic Bit-wise Stationary Computing.

An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention

A 28nm 4.35tops/mm2 Transformer Accelerator with Basis-vector Based Ultra Storage Compression, Decomposed Computation and Unified LUT-Assisted Cores

FET-OPU: A Flexible and Efficient FPGA-Based Overlay Processor for Transformer Networks

A 52.01 TFLOPS/W Diffusion Model Processor with Inter-Time-Step Convolution-Attention-Redundancy Elimination and Bipolar Floating-Point Multiplication

A 28nm 49.7TOPS/W Sparse Transformer Processor with Random-Projection-Based Speculation, Multi-Stationary Dataflow, and Redundant Partial Product Elimination

LTrans-OPU: A Low-Latency FPGA-Based Overlay Processor for Transformer Networks

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

A 28nm 276.55TFLOPS/W Sparse Deep-Neural-Network Training Processor with Implicit Redundancy Speculation and Batch Normalization Reformulation

A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing.

BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge

An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks

H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge Devices

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width

A 28nm 1.07TFLOPS/mm<sup>2</sup> Dynamic-Precision Training Processor with Online Dynamic Execution and Multi- Level-Aligned Block-FP Processing

FACT: FFN-Attention Co-optimized Transformer Architecture with Eager Correlation Prediction.

Improving Transformer Inference Through Optimized Non-Linear Operations with Quantization-Approximation-Based Strategy

TransFRU: Efficient Deployment of Transformers on FPGA with Full Resource Utilization

A 1.06-to-5.09 TOPS/W Reconfigurable Hybrid-Neural-Network Processor for Deep Learning Applications