Abstract:Recurrent neural networks (RNNs) have achieved the state-of-the-art performance on various sequence learning tasks due to their powerful sequence modeling capability. However, RNNs usually require a large number of parameters and high computational complexity. Hence, it is quite challenging to implement complex RNNs on embedded devices with stringent memory and latency requirement. In this paper, we first present a novel hybrid compression method for a widely used RNN variant, long-short term memory (LSTM), to tackle these implementation challenges. By properly using circulant matrices, forward nonlinear function approximation, and efficient quantization schemes with a retrain-based training strategy, the proposed compression method can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks. An efficient scalable parallel hardware architecture is then proposed for the compressed LSTM. With an innovative chessboard division method for matrix-vector multiplications, the parallelism of the proposed hardware architecture can be freely chosen under certain latency requirement. Specifically, for the circulant matrix-vector multiplications employed in the compressed LSTM, the circulant matrices are judiciously reorganized to fit in with the chessboard division and minimize the number of memory accesses required for the matrix multiplications. The proposed architecture is modeled using register transfer language (RTL) and synthesized under the TSMC 90-nm CMOS technology. With 518.5-kB on-chip memory, we are able to process a 512x512 compressed LSTM in 1.71 mu s, corresponding to 2.46 TOPS on the uncompressed one, at a cost of 30.77-mm(2) chip area. The implementation results demonstrate that the proposed design can achieve significantly high flexibility and area efficiency, which satisfies many real-time applications on embedded devices. It is worth mentioning that the memory-efficient approach of accelerating LSTM developed in this paper is also applicable to other RNN variants.

Chipmunk: A Systolically Scalable 0.9 mm${}^2$, 3.08 Gop/s/mW @ 1.2 mW Accelerator for Near-Sensor Recurrent Neural Network Inference

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

Vau da muntanialas: Energy-efficient multi-die scalable acceleration of RNN inference

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

RNNAccel: A Fusion Recurrent Neural Network Accelerator for Edge Intelligence

MASR: A Modular Accelerator for Sparse RNNs

A 5.1pJ/Neuron 127.3us/Inference RNN-based Speech Recognition Processor using 16 Computing-in-Memory SRAM Macros in 65nm CMOS

Accelerator-Aware Training for Transducer-Based Speech Recognition

More is Less: Domain-Specific Speech Recognition Microprocessor Using One-Dimensional Convolutional Recurrent Neural Network

An Ultra-Low Power Binarized Convolutional Neural Network-Based Speech Recognition Processor with On-Chip Self-Learning.

14.1 A 510nw 0.41V Low-Memory Low-Computation Keyword-Spotting Chip Using Serial FFT-Based MFCC and Binarized Depthwise Separable Convolutional Neural Network in 28nm CMOS

CHIMERA: A 0.92-TOPS, 2.2-TOPS/W Edge AI Accelerator With 2-MByte On-Chip Foundry Resistive RAM for Efficient Training and Inference

ReckOn: A 28nm Sub-mm2 Task-Agnostic Spiking Recurrent Neural Network Processor Enabling On-Chip Learning over Second-Long Timescales

OCEAN: an On-Chip Incremental-Learning Enhanced Processor with Gated Recurrent Neural Network Accelerators.

Accelerating Recurrent Neural Networks: A Memory-Efficient Approach

An analog-AI chip for energy-efficient speech recognition and transcription

Spiking neural networks trained with backpropagation for low power neuromorphic implementation of voice activity detection

Low-power Neuromorphic Speech Recognition Engine with Coarse-Grain Sparsity.

Efficient Binary Weight Convolutional Network Accelerator for Speech Recognition

A 1D-CRNN Inspired Reconfigurable Processor for Noise-robust Low-power Keywords Recognition

MorphBungee: A 65-nm 7.2-mm2 27-μJ/image Digital Edge Neuromorphic Chip with On-Chip 802-frame/s Multi-Layer Spiking Neural Network Learning