New word identification based on statistical classifier

Jianyi Liu,Jinghua Wang,Cong Wang
2006-01-01
Abstract:New word identification is difficult in Chinese word segmentation processing. In the automatic word segmentation processing of large Chinese texts, new words can cause segmentation mistakes. This paper defines new word identification as a binary classification problem: decide whether a character sequence in certain context is a new word or not and use two statistical learning approaches based on support vector machine (SVM) and C4. 5. We then investigated various linguistic and statistical features including independent word probability of former characters and latter characters, front position in-word probability of former characters, back position in-word probability of latter characters, mutual information and frequency. In PK-close test of the first special interest group for Chinese language processing (SIGHAN) bakeoff, this approach achieves high precision and recall rate.
What problem does this paper attempt to address?