Abstract:Convolutional neural networks (CNNs) have been widely utilized in modern artificial intelligent (AI) systems. In particular, GoogLeNet, one of the most popular CNNs, consisting of a number of inception layers and max-pooling layers, has been intensively studied for mobile and embedded scenarios. However, the energy efficiency of GoogLeNet in hardware is still limited as the huge data movement between the processor and the memory. Therefore, designing a dataflow and the corresponding hardware architecture to achieve parallel processing with minimal data movement is rather critical to achieve high energy efficiency and throughput. In this paper, we propose a novel column stationary (CS) dataflow that maximally exploits the local data reuse of both the filter weights and feature maps. Moreover, a reconfigurable spatial architecture was proposed to map multiple convolution kernels (with different types and dimensions) in parallel to the processing engines (PEs) array. In this case, multiple convolution kernels can share the same input feature maps (activations) in computing process. In our hardware design, we utilize three typical convolution kernels (i.e., , , , corresponding to the inception layers of GoogLeNet) as an example to test the efficiency of our proposed dataflow and hardware architecture. The accelerator was implemented for one inception layer of the GoogLeNet in a 55-nm foundry's CMOS process. The test results show that our CS dataflow can reduce ~85% energy consumption for memory access and save area of 13% and power of 12% for computing. In summary, our CS dataflow is more energy-efficient compared to state-of-the-art dataflows.

A 65-Nm Energy-Efficient Interframe Data Reuse Neural Network Accelerator for Video Applications

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

An Energy-Efficient Differential Frame Convolutional Accelerator with On-Chip Fusion Storage Architecture and Pixel-Level Pipeline Data Flow

14.2 A 65nm 24.7 µJ/Frame 12.3 mW Activation-Similarity-Aware Convolutional Neural Network Video Processor Using Hybrid Precision, Inter-Frame Data Reuse and Mixed-Bit-Width …

14.2 A 65nm 24.7µj/frame 12.3mw Activation-Similarity-Aware Convolutional Neural Network Video Processor Using Hybrid Precision, Inter-Frame Data Reuse and Mixed-Bit-Width Difference-Frame Data Codec

A Computationally Efficient Neural Video Compression Accelerator Based on a Sparse CNN-Transformer Hybrid Network

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

A Reconfigurable Accelerator for Sparse Convolutional Neural Networks.

An Asynchronous Energy-Efficient CNN Accelerator with Reconfigurable Architecture.

Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering

A High-performance Inference Accelerator Exploiting Patterned Sparsity in CNNs

An Efficient Sparse CNNs Accelerator on FPGA

An Efficient CNN Accelerator for Pattern-Compressed Sparse Neural Networks on FPGA

A Reconfigurable Spatial Architecture for Energy-Efficient Inception Neural Networks

An Energy-Efficient Spiking Neural Network Accelerator Based on Spatio-Temporal Redundancy Reduction

Relative Indexed Compressed Sparse Filter Encoding Format for Hardware-Oriented Acceleration of Deep Convolutional Neural Networks

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

VWA: Hardware Efficient Vectorwise Accelerator for Convolutional Neural Network

A High Efficient Architecture for Convolution Neural Network Accelerator