Abstract:Recurrent neural networks (RNNs) have achieved the state-of-the-art performance on various sequence learning tasks due to their powerful sequence modeling capability. However, RNNs usually require a large number of parameters and high computational complexity. Hence, it is quite challenging to implement complex RNNs on embedded devices with stringent memory and latency requirement. In this paper, we first present a novel hybrid compression method for a widely used RNN variant, long-short term memory (LSTM), to tackle these implementation challenges. By properly using circulant matrices, forward nonlinear function approximation, and efficient quantization schemes with a retrain-based training strategy, the proposed compression method can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks. An efficient scalable parallel hardware architecture is then proposed for the compressed LSTM. With an innovative chessboard division method for matrix-vector multiplications, the parallelism of the proposed hardware architecture can be freely chosen under certain latency requirement. Specifically, for the circulant matrix-vector multiplications employed in the compressed LSTM, the circulant matrices are judiciously reorganized to fit in with the chessboard division and minimize the number of memory accesses required for the matrix multiplications. The proposed architecture is modeled using register transfer language (RTL) and synthesized under the TSMC 90-nm CMOS technology. With 518.5-kB on-chip memory, we are able to process a 512x512 compressed LSTM in 1.71 mu s, corresponding to 2.46 TOPS on the uncompressed one, at a cost of 30.77-mm(2) chip area. The implementation results demonstrate that the proposed design can achieve significantly high flexibility and area efficiency, which satisfies many real-time applications on embedded devices. It is worth mentioning that the memory-efficient approach of accelerating LSTM developed in this paper is also applicable to other RNN variants.

Hardware-Oriented Compression of Long Short-Term Memory for Efficient Inference.

E-LSTM: an Efficient Hardware Architecture for Long Short-Term Memory

Hardware-Guided Symbiotic Training for Compact, Accurate, yet Execution-Efficient LSTM

C-LSTM: Enabling Efficient LSTM Using Structured Compression Techniques on FPGAs

Accelerating Recurrent Neural Networks: A Memory-Efficient Approach

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

A Compact and Configurable Long Short-Term Memory Neural Network Hardware Architecture.

A Configurable FPGA Accelerator of Bi-LSTM Inference with Structured Sparsity

A compression strategy to accelerate LSTM meta-learning on FPGA

A High Energy-Efficiency FPGA-Based LSTM Accelerator Architecture Design by Structured Pruning and Normalized Linear Quantization

Efficient Weight Reuse for Large LSTMs.

Acceleration of LSTM with Structured Pruning Method on FPGA

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System

Memory-Efficient Compression Based on Least-Squares Fitting in Convolutional Neural Network Accelerators.

All-in-one Hardware-Oriented Model Compression for Efficient Multi-Hardware Deployment

A low-latency LSTM accelerator using balanced sparsity based on FPGA

Tight Compression: Compressing CNN Through Fine-Grained Pruning and Weight Permutation for Efficient Implementation

Tight Compression: Compressing CNN Through Fine-Grained Pruning and Weight Permutation for Efficient Implementation

Structured Term Pruning for Computational Efficient Neural Networks Inference

Compressing LSTM Networks by Matrix Product Operators

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization