Experimental Study of Hidden Markov Model Based Part-of-speech Tagging for Chinese Texts

SUNMaosong,LUHongna,TSOUBenjaminK
DOI: https://doi.org/10.3321/j.issn:1000-0054.2000.09.015
2000-01-01
Abstract:The technique of part of speech tagging plays an important role in many applications of Chinese information processing. A large scale manually annotated Chinese corpus and a number of well conducted experiments were used to identify the following points of the hidden Markov model based part of speech tagging scheme for Chinese texts. The results are: ① The Bigram model is better than the Trigram model in terms of the performance cost ratio. ② An annotated corpus of about 70000 words tokens would be sufficient for training the Bigram model, to produce system performance of about 93% tagging accuracy for ambiguous word tokens and 97% tagging accuracy for all word tokens in the texts. ③ The Bigram model can be suited to different application domains quite well. These conclusions will facilitate the development of practical part of speech tagging systems for Chinese texts.
What problem does this paper attempt to address?