Chinese Word Extraction Based on the Internal Associative Strength of Character Strings

罗盛芬,孙茂松
DOI: https://doi.org/10.3969/j.issn.1003-0077.2003.03.002
2003-01-01
Abstract:Word extraction is one of the important tasks in text information processing. A conventional scheme for word extraction is to estimate the soundness of a candidate character string being a word by the internal associative strength among characters involved. In this paper, the authors at first test the performance of nine widely adopted statistical measures of such kind in Chinese word extraction on the individual basis, then try the possibility of improving the performance by properly combining these measures. Genetic algorithm is explored to automatically adjust the weighting of combination. Experiments focusing on two-character Chinese word extraction show that mutual information is most powerful in these measures, achieving the F-measure 54 77%, and the effectiveness of combination is not significant, only achieving the F-measure 55 47%. This suggests that these measures could not supplement well each other, and the simplest and effective way in Chinese word extraction would be using mutual information directly.
What problem does this paper attempt to address?