Chinese Word Frequency Approximation Based on Multitype Corpora.

Wei Qiao,Maosong Sun,Wolfgang Menzel
DOI: https://doi.org/10.1080/09296171003643213
2010-01-01
Journal of Quantitative Linguistics
Abstract:Due to the nature of Chinese, a perfect word-segmented Chinese corpus that is ideal for the task of word frequency estimation may never exist. Therefore, a reliable estimation for Chinese word frequencies remains a challenge. Currently, three types of corpora can be considered for this purpose: raw corpora, automatically word-segmented corpora, and manually word-segmented corpora. As each type has its own advantages and drawbacks, none of them is sufficient alone. In this article, we propose a hybrid scheme which utilizes existing corpora of different types for word frequency approximation. Experiments have been performed from statistical and application-oriented perspectives. We demonstrate that, compared with other schemes, the proposed scheme is the most effective one and leads to better word frequency approximation results.
What problem does this paper attempt to address?