Hardware-Oriented Compression of Long Short-Term Memory for Efficient Inference.

Zhisheng Wang,Jun Lin,Zhongfeng Wang
DOI: https://doi.org/10.1109/lsp.2018.2834872
2018-01-01
IEEE Signal Processing Letters
Abstract:Long short-term memory (LSTM) and its variants have been widely adopted in processing sequential data. However, the intrinsic large memory requirement and high computational complexity make it hard to be employed in embedded systems. This incurs the need of model compression and dedicated hardware accelerator for LSTM. In this letter, efficient clipped gating and top-k pruning schemes are introduced to convert the dense matrix computations in LSTM into structured sparse-matrixsparse-vector multiplications. Then, mixed quantization schemes are developed to eliminate most of the multiplications in LSTM. The proposed compression scheme is well suited for efficient hardware implementations. Experimental results show that the model size and the number of matrix operations can be reduced by 32x and 18.5x, respectively, at a cost of less than 1% accuracy loss on a word-level language modeling task.
What problem does this paper attempt to address?