Character-Based N-gram Model for Uyghur Text Retrieval.

Turdi Tohti,Lirui Xu,Jimmy Huang,Winira Musajan,Askar Hamdulla
DOI: https://doi.org/10.1007/978-3-319-97909-0_72
2018-01-01
Abstract:Uyghur is a low resourced language, but Uyghur Information Retrieval (IR) is getting more and more important recently. Although there are related research results and stem-based Uyghur IR systems, it is always difficult to obtain high-performance retrieval results due to the limitations of the existing stemming method. In this paper, we propose a character-based N-gram model and the corresponding smoothing algorithm for Uyghur IR. A full-text IR system based on character N-gram model is developed using the open-source tool Lucene. A series of experiments and comparative analysis are conducted. Experimental results show that our proposed method has the better performance compared with conventional Uyghur IR systems.
What problem does this paper attempt to address?