From Synchronous Corpus To Monitoring Corpus, Livac: The Chinese Case

Benjamin K. Tsou,Andy C. Chin,Oi Yee Kwong
2011-01-01
Abstract:Very large corpora of properly processed textual materials are uncommon but they can provide important resources for language modeling in natural language processing, ranging from speech processing and text input to automatic IR and patent translation. However, when properly cultivated in spatial-temporal terms, they can foster innovative knowledge discovery in database applications by functioning as monitoring corpus and enhance the human centered communication environment by allowing more substantive introspection and comparison of linguistic and social-cultural developments of the relevant speech communities.This paper discusses how the gigantic synchronous and homothematic corpus of Chinese, LIVAC, can contribute to the monitoring the linguistic homogeneity and heterogeneity diachronically and synchronically. After processing media texts of more than 400 million Chinese characters over 16 years, LIVAC has yielded a lexical corpus of 1.5 million words. This paper examines some aspects of the nature and extent of lexical and morphological divergence and convergence in the Chinese language of Hong Kong, Taipei and Beijing. Additional discussions cover creation and relexification of neologisms, categorial fluidity and the associated challenges to terminology standardization, such as renditions of non-Chinese personal names. This paper also explores how the associated socio-cultural developments can be fruitfully monitored by means of this unique spatial-temporal corpus.
What problem does this paper attempt to address?