Abstract:The low‐power requirement for edge devices poses a challenge in efficiently supporting convolution and matrix operations in Deep Neural Networks (DNNs). This paper proposes a reconfigurable accelerator architecture. A weight stationary‐row streaming (WS‐RS) dataflow scheme is proposed, which maximizes data reuse through hierarchical memory structures. The accelerator achieves peak performance of 1.15 TOPS. Convolution and matrix operations are both important computations in Deep Neural Networks (DNNs). However, the significant differences between convolution and matrix computation patterns have posed a challenge in efficiently supporting both convolution (Conv) and general matrix multiplication (GEMM) on hardware design. This paper proposes a Conv‐GEMM reconfigurable accelerator architecture for high throughput edge processing. A weight stationary‐row streaming (WS‐RS) dataflow scheme is proposed, which maximizes data reuse through hierarchical memory structures and flexible PE connections, and supports high throughput edge‐based deep learning algorithms. Based on the proposed dataflow, multi‐scale memory access network (MMAN), reconfigurable accumulator array (RAA), and configurable instruction set architecture (ISA) are designed to optimize computation throughput and energy efficiency. The accelerator is designed under 65 nm technology, achieves peak performance of 1.15 TOPS at 250 MHz, with an energy efficiency of 1.14 TOPS/W. The GEMM computation achieves 85.7% latency improvement and the Mobilenet‐V1 processing achieves a throughput of 529 fps under a 256 × 224 image size and an 87.15% (top‐5) accuracy on the ImageNet dataset.

OCEAN: an On-Chip Incremental-Learning Enhanced Processor with Gated Recurrent Neural Network Accelerators.

OCEAN: an On-Chip Incremental-Learning Enhanced Artificial Neural Network Processor with Multiple Gated-Recurrent-Unit Accelerators

DaDianNao: A Machine-Learning Supercomputer

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

A Precision-Scalable RISC-V DNN Processor with On-Device Learning Capability at the Extreme Edge

An Energy-Efficient Deep Belief Network Processor Based on Heterogeneous Multi-Core Architecture With Transposable Memory and On-Chip Learning

Detection of Serum Amyloid A Isoforms in Cattle

An Efficient Neuromorphic Implementation of Temporal Coding-Based On-Chip STDP Learning

ReckOn: A 28nm Sub-mm2 Task-Agnostic Spiking Recurrent Neural Network Processor Enabling On-Chip Learning over Second-Long Timescales

Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors

ANP-I: A 28-nm 1.5-pJ/SOP Asynchronous Spiking Neural Network Processor Enabling Sub-0.1-$\mu $J/Sample On-Chip Learning for Edge-AI Applications

A 28nm Configurable Asynchronous SNN Accelerator with Energy-Efficient Learning

PL-NPU: an Energy-Efficient Edge-Device DNN Training Processor with Posit-Based Logarithm-Domain Computing

RRAM-based Analog-Weight Spiking Neural Network Accelerator with In-Situ Learning for IoT Applications

ANP-I: A 28-nm 1.5-pJ/SOP Asynchronous Spiking Neural Network Processor Enabling Sub-0.1-<inline-formula> <tex-math notation="LaTeX">$\mu $</tex-math> </inline-formula>J/Sample On-Chip Learning for Edge-AI Applications

ANP-I: A 28-Nm 1.5-Pj/sop Asynchronous Spiking Neural Network Processor Enabling Sub-0.1-μJ/Sample On-Chip Learning for Edge-AI Applications

Live Demonstration: An Efficient Neural Network Processor with Reduced Data Transmission and On-chip Shortcut Mapping

Audio and Image Cross-Modal Intelligence Via a 10TOPS/W 22nm SoC with Back-Propagation and Dynamic Power Gating

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing