Research on Uyghur Morphological Segmentation Based on Long Sequence Labeling Method.

Ruohao Yan,Huaping Zhang,Wushour Silamu,Askar Hamdulla
DOI: https://doi.org/10.1145/3556384.3556425
2022-01-01
Abstract:With the steady progress of the "One Belt, One Road" national cooperation initiative, the intelligent processing of languages along the route has become increasingly important for communication, and Uyghur is a representative language of agglutinative language. The Uyghur language comprises stems and affixes, and the data is sparse. Morphological segmentation separates Uyghur roots and affixes to solve the problem of data sparseness. First, This paper studies the characteristics of the Uyghur morphological segmentation task and proposes a long sequence labeling method. Secondly, BiLSTM networks learn word formation features, and then the CRF model is used to learn label features. Finally, it proposes a new evaluation method. This paper reproduces relevant research and conducts experiments on the public THUUyMorph corpus, and the model F1 value is 98.60%. Experiments show that the results of this paper are better than the current advanced Uyghur morphological segmentation model, and downstream task Uyghur-Chinese translation experiments prove its effectiveness. This scheme can transfer to other languages along this line, such as Turkish, which provides a new research idea for morphological segmentation.
What problem does this paper attempt to address?