Statistical Properties of Overlapping Word Segmentation Ambiguities in Domain-specific Chinese Corpora

QIAO Wei,SUN Mao-song
DOI: https://doi.org/10.3969/j.issn.1003-0077.2008.04.002
2008-01-01
Abstract:Overlapping ambiguity is a major type of ambiguity in Chinese word segmentation.The performance of existing word segmentation systems in resolving this ambiguity is still unsatisfactory,especially in dealing with domain-specific texts.In contrast to those quite detailed statistical observations on overlapping ambiguities in general-purpose corpus,similar observations in domain-specific corpus have not been reported in the literature.In terms of a medium-sized general-purpose Chinese wordlist,a general-purpose corpus with over 900 million characters and two domain-specific corpora with total 140 million characters covering 55 domains,statistical properties of high frequent overlapping ambiguities are addressed and studied from several perspectives: with overlapping ambiguity string from general corpus examined in the domain corpus,and vice versa.It is believed that the finding of this paper will benefit word segmentation disambiguation in particular for domain-specific texts.
What problem does this paper attempt to address?