A Lexicon-Corpus-Based Unsupervised Chinese Word Segmentation Approach

Lu Pengyu,Pu Jingchuan,Du Mingming,Lou Xiaojuan,Jin Lijun
DOI: https://doi.org/10.21307/ijssis-2017-655
2014-01-01
International Journal on Smart Sensing and Intelligent Systems
Abstract:Abstract This paper presents a Lexicon-Corpus-based Unsupervised (LCU) Chinese word segmentation approach to improve the Chinese word segmentation result. Specifically, it combines advantages of lexicon-based approach and Corpus-based approach to identify out-of-vocabulary (OOV) words and guarantee segmentation consistency of the actual words in texts as well. In addition, a Forward Maximum Fixed-count Segmentation (FMFS) algorithm is developed to identify phrases in texts at first. Detailed rules and experiment results of LCU are presented, too. Compared with lexicon-based approach or corpus-based approach, LCU approach makes a great improvement in Chinese word segmentation, especially for identifying n-char words. And also, two evaluation indexes are proposed to describe the effectiveness in extracting phrases, one is segmentation rate (S), and the other is segmentation consistency degree (D).
What problem does this paper attempt to address?