Yang Jiao,Liang Han,Rong Jin,Yi-Jung Su,Chiente Ho,Li Yin,Yun Li,Long Chen,Zhen Chen,Lu Liu,Zhuyu He,Yu Yan,Jun He,Jun Mao,Xiaotao Zai,Xuejun Wu,Yongquan Zhou,Mingqiu Gu,Guocai Zhu,Rong Zhong,Wenyuan Lee,Ping Chen,Yiping Chen,Weiliang Li,Deyu Xiao,Qing Yan,Mingyuan Zhuang,Jiejun Chen,Yun Tian,Yingzi Lin,Wei Wu,Hao Li,Zesheng Dou

Abstract:Convolutional neural networks (CNN) represent a key application in data centers, which calls for accelerators that are: 1) efficient for CNN computations; 2) having high throughput to be cost-efficient; and, 3) with adequate programming flexibility for algorithm upgrades. Lacking of the availability of such a chip in the market, we designed our own. Matrix multiplication (MM) and convolution (CONV) are the top-2 deep learning (DL) operations requiring intensive computation. Most existing accelerators, like GPUs [6], [7], TPU [9], and a few new AI chips [3], [4] are architected for GEMM. Computing CONV on a GEMM engine, one needs the img2col() transformation to flatten images into general matrixes. This introduces huge data inflation, leading to unnecessary extra computation and storage, but also decreasing arithmetic intensity and bounding performance towards I/O and memory. Although some accelerators such as [5] exploit the CONV architecture directly, integrating larger but balanced computing power into a single chip is quite challenging. Moreover, with the fast evolution of DL algorithms, it is critical to design a programmable neural processing unit (NPU) instead of a dedicated ASIC for data center scenarios. To satisfy the above requirements, our NPU is architected to be CONV-efficient under the control of operation-fused coarse-grained instructions. It integrates as much computing power as possible via squeezed computation with a large SRAM-only design. Also, it delivers programming flexibility via an instruction set architecture (ISA) with coverage for anticipated forward-looking functionality.

A 28-Nm Energy-Efficient Sparse Neural Network Processor for Point Cloud Applications Using Block-Wise Online Neighbor Searching

A 28nm 2D/3D Unified Sparse Convolution Accelerator with Block-Wise Neighbor Searcher for Large-Scaled Voxel-Based Point Cloud Network.

A Demonstration Platform for Large-Scaled Point Cloud Network Based on 28nm 2D/3D Unified Sparse Convolution Accelerator.

DaDianNao: A Machine-Learning Supercomputer

Voxel-CIM: An Efficient Compute-in-Memory Accelerator for Voxel-based Point Cloud Neural Networks

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

An Efficient FPGA Accelerator for Point Cloud

An Efficient Accelerator for Point-based and Voxel-based Point Cloud Neural Networks

A Sparse-Adaptive CNN Processor with Area/Performance balanced N-Way Set-Associate PE Arrays Assisted by a Collision-Aware Scheduler

An Energy-Efficient Computing-in-Memory NN Processor with Set-Associate Blockwise Sparsity and Ping-Pong Weight Update

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

Accelerating DNN-based 3D point cloud processing for mobile computing

7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS

An Energy-Efficient Convolutional Neural Network Processor Architecture Based on a Systolic Array

14.3 A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8tops/w System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse.

14.3 A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8 TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy …

Reconfigurable Spatial-Parallel Stochastic Computing for Accelerating Sparse Convolutional Neural Networks

A 28nm Configurable Asynchronous SNN Accelerator with Energy-Efficient Learning

A NoC-Based Spatial DNN Inference Accelerator with Memory-Friendly Dataflow

An Efficient Accelerator for Sparse Convolutional Neural Networks

A Systolic Computing-in-Memory Array Based Accelerator with Predictive Early Activation for Spatiotemporal Convolutions