Mongolian Word Segmentation Based On Three Character Level Seq2seq Models

Na Liu,Xiangdong Su,Guanglai Gao,Feilong Bao
DOI: https://doi.org/10.1007/978-3-030-04221-9_50
2018-01-01
Abstract:Mongolian word segmentation is splitting the Mongolian words into roots and suffixes. It plays an important role in Mongolian related natural language processing tasks. To improve performance and avoid the tedious work of rule-making and statistics over large-scale corpus in early methods, this work takes a Seq2Seq framework to realize Mongolian word segmentation. Since each Mongolian word consisted of several sequential characters, we map Mongolian word segmentation to character-level Seq2Seq task, and further propose three different models from three different prospective to achieve the segmentation goal. The three character-level Seq2Seq models are (1) translation model, (2) true and pseudo mapping model, (3) binary choice model. The main differences of these three models are the output sequences and the architectures of the RNNs in segmentation. We employ an improved beam search to optimize the second segmentation model and boost the segmentation process. All the models are trained on a limited dataset, and the second model achieved the state-of-the-art accuracy.
What problem does this paper attempt to address?