A Statistical Approach For Resolving Problematical Word Boundaries In Chinese Lexicography

oi yee kwong,benjamin k tsou
DOI: https://doi.org/10.1109/ICSMC.2001.972882
2001-01-01
Abstract:Word segmentation is an all important topic in Chinese language processing. Although state-of-the-art segmentation algorithms demonstrate that more than 90% accuracy could possibly be achieved, there remains a subtle question of what constitutes a Chinese word. In this paper, we focus on particular two-character word strings which often raise doubts even for lexicographers as to whether the two characters should be segmented or kept as one word. We experiment with the feasibility of modelling human judgement on such problematical word boundaries by corpus-based mutual information. Preliminary results show that the strength of correlation between the two measures might be lexically as well as structurally dependent, and mutual information only partially model human judgement on problematic Chinese word boundaries.
What problem does this paper attempt to address?