Combining Text Clustering and Retrieval for Corpus Adaptation

Feng He,Xiaoqing Ding
DOI: https://doi.org/10.1117/12.703646
2007-01-01
Abstract:The application-relevant text data are very useful in various natural language applications. Using them can achieve significantly better performance for vocabulary selection, language modeling, which are widely employed in automatic speech recognition, intelligent input method etc. In some situations, however, the relevant data is hard to collect. Thus, the scarcity of application-relevant training text brings difficulty upon these natural language processing. In this paper, only using a small set of application specific text, by combining unsupervised text clustering and text retrieval techniques, the proposed approach can find the relevant text from unorganized large scale corpus, thereby, adapt training corpus towards the application area of interest. We use the performance of n-gram statistical language model, which is trained from the text retrieved and test on the application-specific text, to evaluate the relevance of the text acquired, accordingly, to validate the effectiveness of our corpus adaptation approach. The language models trained from the ranked text bundles present well discriminated perplexities on the application-specific text. The preliminary experiments on short message text and unorganized large corpus demonstrate the performance of the proposed methods.
What problem does this paper attempt to address?