Abstract:Event-driven spiking neural networks (SNNs) have demonstrated significant potential for achieving high energy and area efficiency. However, existing SNN accelerators suffer from issues such as high latency and energy consumption due to serial accumulation-comparison operations. This is mainly because SNN neurons integrate spikes, accumulate membrane potential, and generate output spikes when the potential exceeds a threshold. To address this, one approach is to leverage the sparsity of SNN spikes to reduce the number of time steps. However, this method can result in imbalanced workloads among neurons and limit the utilization of processing elements (PEs). In this paper, we present SATO, a temporal-parallel SNN accelerator that enables parallel accumulation of membrane potential for all time steps. SATO adopts a two-stage pipeline methodology, effectively decoupling neuron computations. This not only maintains accuracy but also unveils opportunities for fine-grained parallelism. By dividing the neuron computation into distinct stages, SATO enables the concurrent execution of spike accumulation for each time step, leveraging the parallel processing capabilities of modern hardware architectures. This not only enhances the overall efficiency of the accelerator but also reduces latency by exploiting parallelism at a granular level. The architecture of SATO includes a novel binary adder-search tree for generating the output spike train, effectively decoupling the chronological dependence in the accumulation-comparison operation. Furthermore, SATO employs a bucket-sort-based method to evenly distribute compressed workloads to all PEs, maximizing data locality of input spike trains. Experimental results on various SNN models demonstrate that SATO outperforms the well-known accelerator, the 8-bit version of "Eyeriss" by 20.7× in terms of speedup and 6.0× energy-saving, on average. Compared to the state-of-the-art SNN accelerator "SpinalFlow", SATO can also achieve 4.6× performance gain and 3.1× energy reduction on average, which is quite impressive for inference.

A Fast and Power Efficient Architecture to Parallelize LSTM based RNN for Cognitive Intelligence Applications.

DaDianNao: A Machine-Learning Supercomputer

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

An Energy-Efficient Architecture for Accelerating Inference of Memory-Augmented Neural Networks

A Compact and Configurable Long Short-Term Memory Neural Network Hardware Architecture.

Long Short-Term Memory Implementation Exploiting Passive RRAM Crossbar Array

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training

A Power-Efficient Accelerator Based on FPGAs for LSTM Network

Vau da muntanialas: Energy-efficient multi-die scalable acceleration of RNN inference

E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks

Implementation and Optimization of the Accelerator Based on FPGA Hardware for LSTM Network

ReuseSense: With Great Reuse Comes Greater Efficiency; Effectively Employing Computation Reuse on General-Purpose CPUs

Energy Efficient Neural Networks for Big Data Analytics

A Learnable Parallel Processing Architecture Towards Unity of Memory and Computing

EPHA: An Energy-efficient Parallel Hybrid Architecture for ANNs and SNNs

Exploiting Temporal-Unrolled Parallelism for Energy-Efficient SNN Acceleration

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

Ese: Efficient Speech Recognition Engine with Sparse Lstm on Fpga

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System

A Reconfigurable Spatial Architecture for Energy-Efficient Inception Neural Networks