7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS

Yang Jiao,Liang Han,Rong Jin,Yi-Jung Su,Chiente Ho,Li Yin,Yun Li,Long Chen,Zhen Chen,Lu Liu,Zhuyu He,Yu Yan,Jun He,Jun Mao,Xiaotao Zai,Xuejun Wu,Yongquan Zhou,Mingqiu Gu,Guocai Zhu,Rong Zhong,Wenyuan Lee,Ping Chen,Yiping Chen,Weiliang Li,Deyu Xiao,Qing Yan,Mingyuan Zhuang,Jiejun Chen,Yun Tian,Yingzi Lin,Wei Wu,Hao Li,Zesheng Dou
DOI: https://doi.org/10.1109/isscc19947.2020.9062984
2020-01-01
Abstract:Convolutional neural networks (CNN) represent a key application in data centers, which calls for accelerators that are: 1) efficient for CNN computations; 2) having high throughput to be cost-efficient; and, 3) with adequate programming flexibility for algorithm upgrades. Lacking of the availability of such a chip in the market, we designed our own. Matrix multiplication (MM) and convolution (CONV) are the top-2 deep learning (DL) operations requiring intensive computation. Most existing accelerators, like GPUs [6], [7], TPU [9], and a few new AI chips [3], [4] are architected for GEMM. Computing CONV on a GEMM engine, one needs the img2col() transformation to flatten images into general matrixes. This introduces huge data inflation, leading to unnecessary extra computation and storage, but also decreasing arithmetic intensity and bounding performance towards I/O and memory. Although some accelerators such as [5] exploit the CONV architecture directly, integrating larger but balanced computing power into a single chip is quite challenging. Moreover, with the fast evolution of DL algorithms, it is critical to design a programmable neural processing unit (NPU) instead of a dedicated ASIC for data center scenarios. To satisfy the above requirements, our NPU is architected to be CONV-efficient under the control of operation-fused coarse-grained instructions. It integrates as much computing power as possible via squeezed computation with a large SRAM-only design. Also, it delivers programming flexibility via an instruction set architecture (ISA) with coverage for anticipated forward-looking functionality.
What problem does this paper attempt to address?