Reduce Meaningless Words for Joint Chinese Word Segmentation and Part-of-speech Tagging

Kaixu Zhang,Maosong Sun
DOI: https://doi.org/10.48550/arXiv.1305.5918
2013-05-25
Abstract:Conventional statistics-based methods for joint Chinese word segmentation and part-of-speech tagging (S&T) have generalization ability to recognize new words that do not appear in the training data. An undesirable side effect is that a number of meaningless words will be incorrectly created. We propose an effective and efficient framework for S&T that introduces features to significantly reduce meaningless words generation. A general lexicon, Wikepedia and a large-scale raw corpus of 200 billion characters are used to generate word-based features for the wordhood. The word-lattice based framework consists of a character-based model and a word-based model in order to employ our word-based features. Experiments on Penn Chinese treebank 5 show that this method has a 62.9% reduction of meaningless word generation in comparison with the baseline. As a result, the F1 measure for segmentation is increased to 0.984.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the process of joint Chinese word segmentation and part - of - speech tagging (S&T), although statistical methods have the ability to recognize new words that do not appear in the training data, they also wrongly generate many meaningless words at the same time. Specifically, when dealing with new or uncommon words, traditional statistical methods may wrongly segment them into some word segments without practical meaning, thus affecting the overall accuracy and the performance of the system. To solve this problem, the author proposes an effective and efficient framework. By introducing word - based features, the generation of meaningless words is significantly reduced, and then the overall performance of S&T is improved. The author uses general dictionaries, Wikipedia and a large - scale raw corpus containing 200 billion characters to generate word - based features. These features are used to judge whether a string can be regarded as a word (i.e., part - of - speech). In this way, the author hopes to reduce the generation of meaningless words and improve the accuracy of word segmentation at the same time. The experimental results show that compared with the baseline method, this method can reduce the generation of meaningless words by 62.9%, thus increasing the F1 value of word segmentation to 0.984. This indicates that by introducing word - based features, the performance of Chinese word segmentation and part - of - speech tagging can be effectively improved.