Unsupervised Learning and Linguistic Rule Based Algorithm for Uyghur Word Segmentation.

Turdi Tohti,Winira Musajan,Askar Hamdulla
DOI: https://doi.org/10.4304/jmm.9.5.627-634
2014-01-01
Journal of Multimedia
Abstract:Inter-word spaces based traditional word segmentation method not very appropriate for multi-word structured semantic words due to the fact that it will split the semantic words into several fragments that inconsistent with its original meaning. So, this will be a bottleneck problem in Uyghur text analysis and text understanding applications. This paper puts forward a new idea and related algorithms for segmentation of Uyghur multiword structured semantic words. In this algorithm, the word based Bi-gram and contextual information are derived from large scale raw text corpus automatically, and according to the association rules between Uyghur words, the liner combinations of mutual information, difference of t-test and dual adjacent entropy are taken as a new measurement( dmd ) to estimate the agglutinative strength between two adjacent Uyghur words. The experimental result on large-scale open tests shows that the proposed algorithm achieves 88.21% segmentation accuracy
What problem does this paper attempt to address?