Chinese Word Segmentation Probability Dictionary Training and Enrich Solution

Qi Wang,Guangping Zeng,Yonghao Wang,Gui Xunlong
2013-01-01
Abstract:Word segmentation is one necessary component for Asian language search engine, and probability dictionary is core component for statistical language model based word segmentation application. Manually marking is the traditional way to build probability dictionary, slow and low efficient, usually can’t cover recent new words. Society is progressing, there are always new words born in human language. How to include new words into probability dictionary to increase word segmentation application’s recall and precision value is a big challenge for search engine of Asian language, for example Chinese. This article introduces one automatically probability dictionary learning and enriches approach. This unsupervised Machine Learning based solution extracts word appear probability and word transfer probability information from user search logs, learn new words which does not exist in our current lexicon to enrich our tokenization probability dictionary.
What problem does this paper attempt to address?