Abstract:The constant growth of DNNs makes them challenging to implement and run efficiently on traditional compute-centric architectures. Some accelerators have attempted to add more compute units and on-chip buffers to solve the memory wall problem without much success, and sometimes even worsening the issue since more compute units also require higher memory bandwidth. Prior works have proposed the design of memory-centric architectures based on the Near-Data Processing (NDP) paradigm. NDP seeks to break the memory wall by moving the computations closer to the memory hierarchy, reducing the data movements and their cost as much as possible. The 3D-stacked memory is especially appealing for DNN accelerators due to its high-density/low-energy storage and near-memory computation capabilities to perform the DNN operations massively in parallel. However, memory accesses remain as the main bottleneck for running modern DNNs efficiently. To improve the efficiency of DNN inference we present QeiHaN, a hardware accelerator that implements a 3D-stacked memory-centric weight storage scheme to take advantage of a logarithmic quantization of activations. In particular, since activations of FC and CONV layers of modern DNNs are commonly represented as powers of two with negative exponents, QeiHaN performs an implicit in-memory bit-shifting of the DNN weights to reduce memory activity. Only the meaningful bits of the weights required for the bit-shift operation are accessed. Overall, QeiHaN reduces memory accesses by 25\% compared to a standard memory organization. We evaluate QeiHaN on a popular set of DNNs. On average, QeiHaN provides $4.3x$ speedup and $3.5x$ energy savings over a Neurocube-like accelerator.

Data streaming and traffic gathering in mesh-based NoC for deep neural network acceleration

Improving the Performance of a NoC-based CNN Accelerator with Gather Support

DaDianNao: A Machine-Learning Supercomputer

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

Dataflow-Architecture Co-Design for 2.5D DNN Accelerators using Wireless Network-on-Package

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

URMP: using reconfigurable multicast path for NoC-based deep neural network accelerators

A tree-recursive partitioned multicast mechanism for NoC-based deep neural network accelerator

A Data-Driven Asynchronous Neural Network Accelerator

PATRONoC: Parallel AXI Transport Reducing Overhead for Networks-on-Chip targeting Multi-Accelerator DNN Platforms at the Edge

NeuronLink: An Efficient Chip-to-Chip Interconnect for Large-Scale Neural Network Accelerators

A NoC-Based Spatial DNN Inference Accelerator with Memory-Friendly Dataflow

OECS: A Novel Deep Convolutional Neural Network Accelerator Based on 3D Hybrid Optical-Electrical NoC

A Power-Efficient Network-On-Chip for Multi-Core Stream Processors

Embedded Streaming Deep Neural Networks Accelerator With Applications

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

DCP: Learning Accelerator Dataflow for Neural Network via Propagation

A Practical Implementation of GPU based Accelerator for Deep Neural Networks

Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration

An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip