Abstract:Recurrent neural networks (RNNs) have achieved the state-of-the-art performance on various sequence learning tasks due to their powerful sequence modeling capability. However, RNNs usually require a large number of parameters and high computational complexity. Hence, it is quite challenging to implement complex RNNs on embedded devices with stringent memory and latency requirement. In this paper, we first present a novel hybrid compression method for a widely used RNN variant, long-short term memory (LSTM), to tackle these implementation challenges. By properly using circulant matrices, forward nonlinear function approximation, and efficient quantization schemes with a retrain-based training strategy, the proposed compression method can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks. An efficient scalable parallel hardware architecture is then proposed for the compressed LSTM. With an innovative chessboard division method for matrix-vector multiplications, the parallelism of the proposed hardware architecture can be freely chosen under certain latency requirement. Specifically, for the circulant matrix-vector multiplications employed in the compressed LSTM, the circulant matrices are judiciously reorganized to fit in with the chessboard division and minimize the number of memory accesses required for the matrix multiplications. The proposed architecture is modeled using register transfer language (RTL) and synthesized under the TSMC 90-nm CMOS technology. With 518.5-kB on-chip memory, we are able to process a 512x512 compressed LSTM in 1.71 mu s, corresponding to 2.46 TOPS on the uncompressed one, at a cost of 30.77-mm(2) chip area. The implementation results demonstrate that the proposed design can achieve significantly high flexibility and area efficiency, which satisfies many real-time applications on embedded devices. It is worth mentioning that the memory-efficient approach of accelerating LSTM developed in this paper is also applicable to other RNN variants.

Hardware-Guided Symbiotic Training for Compact, Accurate, yet Execution-Efficient LSTM

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

Hardware-Oriented Compression of Long Short-Term Memory for Efficient Inference.

A Compact and Configurable Long Short-Term Memory Neural Network Hardware Architecture.

E-LSTM: an Efficient Hardware Architecture for Long Short-Term Memory

Accelerating Recurrent Neural Networks: A Memory-Efficient Approach

Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training

C-LSTM: Enabling Efficient LSTM Using Structured Compression Techniques on FPGAs

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

A Highly Configurable 7.62gop/s Hardware Implementation for LSTM

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

Compressed LSTM Using Balanced Sparsity

DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks

Efficient Network Construction Through Structural Plasticity

Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

A low-latency LSTM accelerator using balanced sparsity based on FPGA

Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference

Towards Energy-Efficient, Low-Latency and Accurate Spiking LSTMs

Aggressive Post-Training Compression on Extremely Large Language Models

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference