Neural Network Language Model Compression with Product Quantization and Soft Binarization

Kai Yu,Rao Ma,Kaiyu Shi,Qi Liu
DOI: https://doi.org/10.1109/taslp.2020.3015659
2020-01-01
Abstract:Large memory consumption of the neural network language models (NN LMs) prohibits their use in many resource-constrained scenarios. Hence, effective NN LM compression approaches that are independent of NN structures are of great interest. However, previous approaches usually achieve a high compression ratio at the cost of obvious performance loss. In this paper, two recently proposed quantization approaches, product quantization (PQ) and soft binarization are effectively combined to address the issue. PQ decomposes word embedding matrices into a Cartesian product of low dimensional subspaces and quantizes each subspace separately. Soft binarization uses a small number of float scalars and the knowledge distillation technique to recover the performance loss during the binarization. Experiments show that the proposed approaches can achieve a high compression ratio, from 70 to over 100, while still maintaining comparable performance to the uncompressed NN LM on both PPL and word error rate criteria.
What problem does this paper attempt to address?