Abstract:Convolutional neural networks (CNNs) have been widely utilized in modern artificial intelligent (AI) systems. In particular, GoogLeNet, one of the most popular CNNs, consisting of a number of inception layers and max-pooling layers, has been intensively studied for mobile and embedded scenarios. However, the energy efficiency of GoogLeNet in hardware is still limited as the huge data movement between the processor and the memory. Therefore, designing a dataflow and the corresponding hardware architecture to achieve parallel processing with minimal data movement is rather critical to achieve high energy efficiency and throughput. In this paper, we propose a novel column stationary (CS) dataflow that maximally exploits the local data reuse of both the filter weights and feature maps. Moreover, a reconfigurable spatial architecture was proposed to map multiple convolution kernels (with different types and dimensions) in parallel to the processing engines (PEs) array. In this case, multiple convolution kernels can share the same input feature maps (activations) in computing process. In our hardware design, we utilize three typical convolution kernels (i.e., , , , corresponding to the inception layers of GoogLeNet) as an example to test the efficiency of our proposed dataflow and hardware architecture. The accelerator was implemented for one inception layer of the GoogLeNet in a 55-nm foundry's CMOS process. The test results show that our CS dataflow can reduce ~85% energy consumption for memory access and save area of 13% and power of 12% for computing. In summary, our CS dataflow is more energy-efficient compared to state-of-the-art dataflows.

Pipeline Design of Nonvolatile-based Computing in Memory for Convolutional Neural Networks Inference Accelerators

DaDianNao: A Machine-Learning Supercomputer

A Convolution Neural Network Accelerator Design with Weight Mapping and Pipeline Optimization

Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering

Floating Gate Transistor‐Based Accurate Digital In‐Memory Computing for Deep Neural Networks

On Designing Efficient and Reliable Nonvolatile Memory-Based Computing-In-Memory Accelerators

A Reconfigurable Computing-in-Memory Accelerator with Dynamic Group-Based Dataflow and Dual-Input Macro Designs

Pipeline Gradient-based Model Training on Analog In-memory Accelerators

NAND-SPIN-based processing-in-MRAM architecture for convolutional neural network acceleration

Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors

A Reconfigurable Spatial Architecture for Energy-Efficient Inception Neural Networks

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks

Enhancing ConvNets With ConvFIFO: A Crossbar PIM Architecture Based on Kernel-Stationary First-In-First-Out Dataflow

Lattice: an ADC/DAC-less ReRAM-based Processing-In-Memory Architecture for Accelerating Deep Convolution Neural Networks

Design of Computing-in-Memory (CIM) with Vertical Split-Gate Flash Memory for Deep Neural Network (DNN) Inference Accelerator

A NoC-Based Spatial DNN Inference Accelerator with Memory-Friendly Dataflow

Simulation of a Fully Digital Computing-in-Memory for Non-Volatile Memory for Artificial Intelligence Edge Applications

Efficient Discrete Temporal Coding Spike-Driven In-Memory Computing Macro for Deep Neural Network Based on Nonvolatile Memory.

A Heterogeneous Microprocessor for Intermittent AI Inference Using Nonvolatile-SRAM-based Compute-In-Memory

A Convolutional Spiking Neural Network Accelerator with the Sparsity-Aware Memory and Compressed Weights