Abstract:Recurrent neural networks (RNNs) have achieved the state-of-the-art performance on various sequence learning tasks due to their powerful sequence modeling capability. However, RNNs usually require a large number of parameters and high computational complexity. Hence, it is quite challenging to implement complex RNNs on embedded devices with stringent memory and latency requirement. In this paper, we first present a novel hybrid compression method for a widely used RNN variant, long-short term memory (LSTM), to tackle these implementation challenges. By properly using circulant matrices, forward nonlinear function approximation, and efficient quantization schemes with a retrain-based training strategy, the proposed compression method can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks. An efficient scalable parallel hardware architecture is then proposed for the compressed LSTM. With an innovative chessboard division method for matrix-vector multiplications, the parallelism of the proposed hardware architecture can be freely chosen under certain latency requirement. Specifically, for the circulant matrix-vector multiplications employed in the compressed LSTM, the circulant matrices are judiciously reorganized to fit in with the chessboard division and minimize the number of memory accesses required for the matrix multiplications. The proposed architecture is modeled using register transfer language (RTL) and synthesized under the TSMC 90-nm CMOS technology. With 518.5-kB on-chip memory, we are able to process a 512x512 compressed LSTM in 1.71 mu s, corresponding to 2.46 TOPS on the uncompressed one, at a cost of 30.77-mm(2) chip area. The implementation results demonstrate that the proposed design can achieve significantly high flexibility and area efficiency, which satisfies many real-time applications on embedded devices. It is worth mentioning that the memory-efficient approach of accelerating LSTM developed in this paper is also applicable to other RNN variants.

Structured Word Embedding For Low Memory Neural Network Language Model

Neural Network Language Model Compression with Product Quantization and Soft Binarization

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

Highly Efficient Neural Network Language Model Compression Using Soft Binarization Training

Compressing Neural Language Models by Sparse Word Representations

Binarized LSTM Language Model.

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

Accelerating Neural Machine Translation with Partial Word Embedding Compression.

Recurrent Neural Network Language Model With Structured Word Embeddings For Speech Recognition

Lightweight Adaptation of Neural Language Models via Subspace Embedding

Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers

From Fully Trained to Fully Random Embeddings: Improving Neural Machine Translation with Compact Word Embedding Tables

Accelerating Recurrent Neural Networks: A Memory-Efficient Approach

Structured Compression by Weight Encryption for Unstructured Pruning and Quantization

Low-Memory Neural Network Training: A Technical Report

A Novel Low-Bit Quantization Strategy for Compressing Deep Neural Networks

Fast Oov Words Incorporation Using Structured Word Embeddings for Neural Network Language Model.

Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference

Joint Goal for Word Embedding Compression Based on Word Frequency

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation