Chinese Word Segmentation without Using Dictionary Based on Unsupervised Learning Strategy

孙茂松,肖明,邹嘉彦
DOI: https://doi.org/10.3321/j.issn:0254-4164.2004.06.003
2004-01-01
Jisuanji Xuebao/Chinese Journal of Computers
Abstract:The study of word segmentation using neither a dictionary nor an annotated corpus will help to improve the robustness of Chinese word segmentation systems in practical applications. This paper presents a new statistical measure, md, for estimating the associative strength between two adjacent Chinese characters in terms of a linear combination of two common statistical measures, namely, mutual information and t-test, puts forward the concepts of peak and valley of md, and gives an algorithm accordingly. The experimental result on large-scale open tests shows that the proposed algorithm achieves 85.88% segmentation accuracy (defined as the ratio of the number of correctly segmented character locations to the total number of character locations in the testing texts). The most advantage of the algorithm is that it can easily adapt to new application domains because all statistical data needed is based on Bigram of Chinese characters derived from raw corpora automatically.
What problem does this paper attempt to address?