Algorithm for Solving 3-Character Crossing Ambiguities in Chinese Word Segmentation

SUN Maosong,ZUO Zhengping,HUANG Changning
DOI: https://doi.org/10.3321/j.issn:1000-0054.1999.05.027
1999-01-01
Abstract:The technique of Chinese word segmentation plays an important role in many applications of Chinese information processing. Being one of the major types of segmentation ambiguities, crossing ambiguities with length of 3 characters can be frequently found in Chinese running texts. An algorithm aiming at this type of ambiguities is proposed in the paper: instead of making use of part of speech statistical information which needs comparatively high training cost, the algorithm simply employs word frequency information and some common Chinese character subsets with defined properties. The preliminary experiment on 5367 examples, extracted from a Chinese corpus of 0.6 million characters, shows that the segmentation precision of the algorithm reaches 92.07%, which is satisfactory for practical Chinese information processing systems.
What problem does this paper attempt to address?