Run-Time Efficient RNN Compression for Inference on Edge Devices

Urmish Thakker,Jesse Beu,Dibakar Gope,Ganesh Dasika,Matthew Mattina
DOI: https://doi.org/10.48550/arXiv.1906.04886
IF: 5.414
2019-06-12
Machine Learning
Abstract:Recurrent neural networks can be large and compute-intensive, yet many applications that benefit from RNNs run on small devices with very limited compute and storage capabilities while still having run-time constraints. As a result, there is a need for compression techniques that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper explores a new compressed RNN cell implementation called Hybrid Matrix Decomposition (HMD) that achieves this dual objective. This scheme divides the weight matrix into two parts - an unconstrained upper half and a lower half composed of rank-1 blocks. This results in output features where the upper sub-vector has "richer" features while the lower-sub vector has "constrained features". HMD can compress RNNs by a factor of 2-4x while having a faster run-time than pruning (Zhu &Gupta, 2017) and retaining more model accuracy than matrix factorization (Grachev et al., 2017). We evaluate this technique on 5 benchmarks spanning 3 different applications, illustrating its generality in the domain of edge computing.
What problem does this paper attempt to address?