SemiMap: A Semi-Folded Convolution Mapping for Speed-Overhead Balance on Crossbars.

Lei Deng,Ling Liang,Guanrui Wang,Liang Chang,Xing Hu,Xin Ma,Liu,Jing Pei,Guoqi Li,Yuan Xie
DOI: https://doi.org/10.1109/tcad.2018.2883959
IF: 2.9
2020-01-01
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract:Crossbar architecture has been widely used in neural network (NN) accelerators, involving conventional and emerging devices. It performs well on the fully connected layer through efficient vector–matrix multiplication. Whereas, the advantages degrade on the convolutional layer with huge data reuse, since the execution speed and resource overhead are imbalanced when using existing fully unfolded or fully folded mapping strategy. To address this issue, we propose a novel semi-folded mapping (SemiMap) framework for implementing the convolution on crossbars. It simultaneously folds the physical resources along the row dimension of feature maps (FMs) and unfolds them along the column dimension. The former reduces the resource overhead, and the latter maintains the parallelism. An FM slicing scheme is further proposed to enable the processing of large-size image. Via our mapping framework, a row-by-row streaming pipeline for intraimage dataflow and periodical pipeline for interimage dataflow are easy to be obtained. To validate the idea, we build a many-crossbar architecture with several designs to guarantee the overall functionality and performance. Based on the measurement data of a fabricated chip, a mapping compiler and a cycle-accurate simulator are developed for the hardware simulation of large-scale networks. We evaluate the proposed SemiMap on various convolutional NNs across different network scale. ${>} 35 {\times }$ resource saving and several hundred times cycle reduction are demonstrated compared to the existing fully unfolded and fully folded strategies, respectively. This paper jumps out of the current extreme mapping schemes, and provides a balanced solution on how to efficiently deploy the computational graphs with data reuse on many-crossbar architecture.
What problem does this paper attempt to address?