Incorporate Web Search Technology to Solve Out-of-Vocabulary Words in Chinese Word Segmentation.

Wei Qiao,Maosong Sun
2009-01-01
Abstract:. Chinese word segmentation (CWS) is the fundamental technology for many NLP-related applications. It is reported that more than 60% of segmentation errors is caused by the out-of-vocabulary (OOV) words. Recent studies in CWS show that, statistical machine learning method is, to some extent, effective on solving OOV words. But labeled data is limited in size and unbalanced in content which makes it impossible to obtain all the required knowledge to recognize OOV words. In this paper, large scaled web data is incorporated as knowledge supplement. A framework which combines using web search technology and machine learning method is proposed. For each sentence, basic segmentation is performed using linear-chain Conditional Random Fields (CRF) model. Substrings which CRF model gives low confidence decisions are extracted and sent to search engine to perform web search based word segmentation. Final decision is made by considering both CRF model based segmentation result and that of web search based result. Evaluations are conducted on SIGHAN Bakeoff 2005 and 2006 datasets, showing the effectiveness of the proposed framework on dealing with OOV words.
What problem does this paper attempt to address?