A Study on Natural Typing Annotations for Building Corpus of Chinese Word Segmentation

Dakui ZHANG,Dechun YIN,Shiping TANG,Yu MAO,Xiaozhong FAN
DOI: https://doi.org/10.3969/j.issn.1003-0077.2018.02.008
2018-01-01
Abstract:With the optimization of Chinese word segmentation algorithms,the performance of a word segmenter is more dependent on the coverage and completeness of the training corpus.Therefore,how to quickly,effectively,au-tomatically build word segmentation corpus has become a pressing issue to be addressed.This paper aims to explore the valuable natural word segmentation information,which is produced when users type in Chinese text.This infor-mation provides a new perspective for building Chinese segmentation training corpus,which is less touched in the lit-erature.In this paper,we have shown that user-produced word segmentation information can be used to segmenta-tion corpus,and its performance is acceptable.Moreover,some texts with this information from the excellent users are very close to the gold standard segmentation result.In this study,we use the classification model and the voting mechanism to find three of these excellent users,and get texts with natural word segmentation information.Experi-mental results show that these texts can be used to build segmentation training corpus,which greatly improves the accuracy of the segmenter.
What problem does this paper attempt to address?