Word frequency approximation for chinese without using manually-annotated corpus

Maosong Sun,Zhengcao Zhang,Benjamin Ka-Yin T’sou,Huaming Lu
DOI: https://doi.org/10.1007/11671299_13
2006-01-01
Abstract:Word frequencies play important roles in a variety of NLP-related applications. Word frequency estimation for Chinese is a big challenge due to characteristics of Chinese, in particular word-formation and word segmentation. This paper concerns the issue of word frequency estimation in the condition that we only have a Chinese wordlist and a raw Chinese corpus with arbitrarily large size, and do not perform any manual annotation to the corpus. Several realistic schemes for approximating word frequencies under the framework of STR (frequency of string of characters as an approximation of word frequency) and MM (Maximal matching) are presented. Large-scale experiments indicate that the proposed scheme, MinMaxMM, can significantly benefit the estimation of word frequencies, though its performance is still not very satisfactory in some cases.
What problem does this paper attempt to address?