Annotation Scheme for Chinese Treebank

周强
DOI: https://doi.org/10.3969/j.issn.1003-0077.2004.04.001
2004-01-01
Abstract:The syntactically annotated corpora, commonly called ‘treebanks’, play an important role in empirical linguistics as well as in machine learning methods in natural language processing. After a brief summarization of several treebank annotation of different language, we proposed a new annotation scheme for Chinese treebank in this paper. Under this scheme, every Chinese sentence will be annotated with a complete parse tree, where each non terminal constituent is assigned with two tags. One is the syntactic constituent tag, which describes its external functional relation with other constituents in the parse tree. The other is the grammatical relation tag, which describes the internal structural relation of its sub components. These two tag sets consist of 16 and 27 tags respectively. They form an integrated annotation for the syntactic constituent in a parse tree through top down and bottom up descriptions. Based on this scheme, we built a 1,000,000 words Chinese treebank covering a balanced collection of journalistic, literary, academic, and other documents. The annotating experiments on different kinds of complex linguistic phenomena show the availability and compatibility of this annotation scheme.
What problem does this paper attempt to address?