Recent Developments in Chinese Corpus Research
Zhan Weidong,Chang Baobao,Duan Huiming,Zhang Huarui
2006-01-01
Abstract:In this paper, the author firstly gives a brief overview of the history of developing Chinese corpora in mainland of China, especially focusing on some representative research projects in the last decade, such as the General Contermporary Chinese Corpus that is sponsored by the State Language Commission of China National Ministry of Education, and the Chinese Corpus of Situated Discourse in Beijing Area that is built up by China Academy of Social Science, and so on. And then the related works in this field made by Peking University on designing, annotating and using of corpus are elaborated. There are four parts are discussed in detail, including (1) a very large scale of wide time-span Chinese corpus using for linguistic research with an on-line KWIC concordance based on Web-Lucene search engine, (2) People Daily corpus which is processed with word segmentation and part-of-speech tagging, (3) a Chinese Treebank. Based on the Treebank, Chinese phrasal constructing rules can be extracted automatically, and the distribution of all kinds of phrases can be described through statistical approach. (4) a Chinese-English parallel corpus based on which a workbench prototype has been built to support Chinese-English lexicography. In the latter part of this paper, the author discusses briefly some issues, which have received more attention in this field recently, including the standardization of Chinese corpora encoding and the approaches to share large-scale Chinese corpora for researches and public use.