Abstract:Recurrent Neural Networks (RNNs) are becoming increasingly important for time series-related applications which require efficient and real-time implementations. The two major types are Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. It is a challenging task to have real-time, efficient, and accurate hardware RNN implementations because of the high sensitivity to imprecision accumulation and the requirement of special activation function implementations. A key limitation of the prior works is the lack of a systematic design optimization framework of RNN model and hardware implementations, especially when the block size (or compression ratio) should be jointly optimized with RNN type, layer size, etc. In this paper, we adopt the block-circulant matrix-based framework, and present the Efficient RNN (E-RNN) framework for FPGA implementations of the Automatic Speech Recognition (ASR) application. The overall goal is to improve performance/energy efficiency under accuracy requirement. We use the alternating direction method of multipliers (ADMM) technique for more accurate block-circulant training, and present two design explorations providing guidance on block size and reducing RNN training trials. Based on the two observations, we decompose E-RNN in two phases: Phase I on determining RNN model to reduce computation and storage subject to accuracy requirement, and Phase II on hardware implementations given RNN model, including processing element design/optimization, quantization, activation implementation, etc. Experimental results on actual FPGA deployments show that E-RNN achieves a maximum energy efficiency improvement of 37.4$\times$ compared with ESE, and more than 2$\times$ compared with C-LSTM, under the same accuracy.

FPGA-based Accelerator for Long Short-Term Memory Recurrent Neural Networks

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

A Power-Efficient Accelerator Based on FPGAs for LSTM Network

Implementation and Optimization of the Accelerator Based on FPGA Hardware for LSTM Network

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

A Spiking LSTM Accelerator for Automatic Speech Recognition Application Based on FPGA

FPGA Implementation of LSTM Based on Automatic Speech Recognition

FPGA Acceleration of Recurrent Neural Network Based Language Model

FiC-RNN: A Multi-FPGA Acceleration Framework for Deep Recurrent Neural Networks

A Compact and Configurable Long Short-Term Memory Neural Network Hardware Architecture.

A low-latency LSTM accelerator using balanced sparsity based on FPGA

Accelerating RNNs on FPGA with HBM

Acceleration of LSTM with Structured Pruning Method on FPGA

FPGA-based Accelerator for Convolutional Neural Network

A Highly Configurable 7.62gop/s Hardware Implementation for LSTM

The implementation of a Deep Recurrent Neural Network Language Model on a Xilinx FPGA

A Comprehensive Evaluation of FPGA-Based Spatial Acceleration of LLMs

E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs

Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators Through Attention Fusion

A Convolutional Neural Network Accelerator Based on FPGA

A compression strategy to accelerate LSTM meta-learning on FPGA