Chinese Word Frequency Approximation Based on Multitype Corpora.
Wei Qiao,Maosong Sun,Wolfgang Menzel
DOI: https://doi.org/10.1080/09296171003643213
2010-01-01
Journal of Quantitative Linguistics
Abstract:Due to the nature of Chinese, a perfect word-segmented Chinese corpus that is ideal for the task of word frequency estimation may never exist. Therefore, a reliable estimation for Chinese word frequencies remains a challenge. Currently, three types of corpora can be considered for this purpose: raw corpora, automatically word-segmented corpora, and manually word-segmented corpora. As each type has its own advantages and drawbacks, none of them is sufficient alone. In this article, we propose a hybrid scheme which utilizes existing corpora of different types for word frequency approximation. Experiments have been performed from statistical and application-oriented perspectives. We demonstrate that, compared with other schemes, the proposed scheme is the most effective one and leads to better word frequency approximation results.