7.5 A 65nm 0.39-to-140.3tops/w 1-to-12b Unified Neural Network Processor Using Block-Circulant-Enabled Transpose-Domain Acceleration with 8.1 × Higher TOPS/mm2and 6T HBST-TRAM-Based 2D Data-Reuse Architecture

Jinshan Yue,Ruoyang Liu,Wenyu Sun,Zhe Yuan,Zhibo Wang,Yung-Ning Tu,Yi-Ju Chen,Ao Ren,Yanzhi Wang,Meng-Fan Chang,Xueqing Li,Huazhong Yang,Yongpan Liu
DOI: https://doi.org/10.1109/isscc.2019.8662360
2019-01-01
Abstract:Energy-efficient neural-network (NN) processors have been proposed for battery-powered deep-learning applications, where convolutional (CNN), fully-connected (FC) and recurrent NNs (RNN) are three major workloads. To support all of them, previous solutions [1–3] use either area-inefficient heterogeneous architectures, including CNN and RNN cores, or an energy-inefficient reconfigurable architecture. A block-circulant algorithm [4] can unify CNN/FC/RNN workloads with transpose-domain acceleration, as shown in Fig. 7.5.1. Once NN weights are trained using the block-circulant pattern, all workloads are transformed into consistent matrix-vector multiplications (MVM), which can potentially achieve 8 to-128$\times$ storage savings and a O($\mathrm{n}^{2}$)-to-O(nlog(n)) computation complexity reduction.
What problem does this paper attempt to address?