Chinese Chunking: Spanning Differences Between Corpora

Ying Qin,Xiaojie Wang,Yixin Zhong
2007-01-01
Abstract:Annotated corpora are valuable to natural language processing. However, the specifications of, corpus designed by different organizers are various, which block the sharing of corpus resources to some extent. This paper discusses problems of annotation transformation of different corpora, proposes an approach of annotation transformation and evaluation. One of application examples of annotation transformation is Chinese chunking task. A chunker trained on the data of Upenn Chinese Treebank is used to identify chunks in People Daily pewswire, spanning di.versities in word segmentation specifications and POS tag set of these two corpora. With corpus annotation transformation precision about 86.97%, we succeed in identifying chunks in texts following Peking University annotation specification.
What problem does this paper attempt to address?