CLMAD: A Chinese Language Model Adaptation Dataset

Ye Bai,Jianhua Tao,Jiangyan Yi,Zhengqi Wen,Cunhang Fan
DOI: https://doi.org/10.1109/iscslp.2018.8706600
2018-01-01
Abstract:A language model (LM) is an important part of a speech recognition system. Language model adaptation techniques use a large amount of source domain data and limited target domain data to improve the performance of language models in target domain. Even though text datasets are easy to obtain, there is no public Chinese text dataset for language model adaptation tasks. This paper presents a language model adaptation dataset which consists of four different domains of news data, i.e., sport, stock, fashion, finance. The discrepancy between the domains of data is evaluated. Model combination based adaptation of n-gram is evaluated on the dataset. Three different fine-tuning adaptation methods of recurrent neural network language models (RNNLMs) are evaluated. WER results on AIShell speech data with the language models trained on this dataset are also provided. The absolute WER reduction of lattice rescoring with adapted RNNLM is 4.74%.
What problem does this paper attempt to address?