Automatic Collecting of Text Data for Cantonese Language Modeling

Jiang CAO,Xiaojun WU,Yu Ting Yeung,Tan LEE,Thomas Fang Zheng
2008-01-01
Abstract:It is hard to collect corpora used to train good language models for many minority languages. Cantonese, one of the most popular Chinese dialects, is such a kind of language, lacking of language materials for language model training. This is a very big obstruction for the processing of Cantonese language. Unlike many other languages, there are great differences between written and colloquial Cantonese. What's more, people in Hong Kong are using mixed Cantonese and English while they talk, which is also a special characteristic of this language. Beyond these, the materials collected from different sources have different proportion of colloquial Cantonese sentences, which means that different sources should not be equally treated. We developed a filter model, which was built up at lexical and grammar levels. We trained this model using a development set and achieved a precision rate of 99.89% and a recall rate of 88.2% in the test set. With this model, we found a method to define the credibility for the different material sources. It was an iterative process and the proportion of the sentences chosen from different sources for model training is decided by its result.
What problem does this paper attempt to address?