Abstract:In this paper, the author firstly gives a brief overview of the history of developing Chinese corpora in mainland of China, especially focusing on some representative research projects in the last decade, such as the General Contermporary Chinese Corpus that is sponsored by the State Language Commission of China National Ministry of Education, and the Chinese Corpus of Situated Discourse in Beijing Area that is built up by China Academy of Social Science, and so on. And then the related works in this field made by Peking University on designing, annotating and using of corpus are elaborated. There are four parts are discussed in detail, including (1) a very large scale of wide time-span Chinese corpus using for linguistic research with an on-line KWIC concordance based on Web-Lucene search engine, (2) People Daily corpus which is processed with word segmentation and part-of-speech tagging, (3) a Chinese Treebank. Based on the Treebank, Chinese phrasal constructing rules can be extracted automatically, and the distribution of all kinds of phrases can be described through statistical approach. (4) a Chinese-English parallel corpus based on which a workbench prototype has been built to support Chinese-English lexicography. In the latter part of this paper, the author discusses briefly some issues, which have received more attention in this field recently, including the standardization of Chinese corpora encoding and the approaches to share large-scale Chinese corpora for researches and public use.

Method of new Chinese words identification from large scale network corpora

New Word Identification in Social Network Text Based on Time Series Information

Internet-oriented Chinese New Words Detection

Research on Intelligent Construction of China English Network New Words Database Based on Adjacent Entropy Recognition Algorithm

New Word Detection Using BiLSTM+CRF Model with Features

Automatic Chinese name recognition based on web corpus analysis

Research on algorithm for networks new words identification

Automatically Building Large-Scale Named Entity Recognition Corpora from Chinese Wikipedia

Domain-Specific New Words Detection in Chinese.

Detecting new Chinese words from massive domain texts with word embedding

A study on the classification of stylistic and formal features in English based on corpus data testing

New Words Recognition Algorithm and Application Based on Micro-Blog Hot

New Word Detection For Sentiment Analysis

On Construction of a Chinese Corpus Bused on Semantic Dependency Relations

Large-scale Automatic Extraction of Chinese Compound Lexical Cohesion Pairs

Extract Chinese Unknown Words from a Large-scale Corpus Using Morphological and Distributional Evidences.

Quality Assurance Of Automatic Annotation Of Very Large Corpora: A Study Based On Heterogeneous Tagging Systems

LSICC: A Large Scale Informal Chinese Corpus

Resolving error accumulation of automatically acquiring bilingual lexical knowledge by semantic similarity

Deep Learning for Chinese Word Segmentation and POS Tagging.

Recent Developments in Chinese Corpus Research