Abstract:Given its generality in applications and its high time-efficiency on big data-sets, in recent years, the technique of text filtering through pattern matching has been attracting increasing attention from the field of information retrieval and Natural language Processing (NLP) research communities at large. That being the case, however, it has yet to be seen how this technique and its algorithms, (e.g., Wu–Manber, which is also considered in this paper) can be applied and adopted properly and effectively to Uyghur, a low-resource language that is mostly spoken by the ethnic Uyghur group with a population of more than eleven-million in Xinjiang, China. We observe that technically, the challenge is mainly caused by two factors: (1) Vowel weakening and (2) mismatching in semantics between affixes and stems. Accordingly, in this paper, we propose Wu–Manber–Uy, a variant of an improvement to Wu–Manber, dedicated particularly for working on the Uyghur language. Wu–Manber–Uy implements a stem deformation-based pattern expansion strategy, specifically for reducing the mismatching of patterns caused by vowel weakening and spelling errors. A two-way strategy that applies invigilation and control on the change of lexical meaning of stems during word-building is also used in Wu–Manber–Uy. Extra consideration with respect to Word2vec and the dictionary are incorporated into the system for processing Uyghur. The experimental results we have obtained consistently demonstrate the high performance of Wu–Manber–Uy.

Lemmatization of Uyghur Inflectional Words

Morpheme-Based Uyghur Speech Recognition Considering Vowel Weakening

Research on Uyghur Morphological Segmentation Based on Long Sequence Labeling Method.

Polygon-Location Method Based on Uyghur Text Regional Rules

Modeling Uyghur Speech Phenomena with Morphological Rules

Uyghur-Chinese statistical machine translation by incorporating morphological information

Incorporating External POS Tagger for Punctuation Restoration

Uyghur Word Segmentation Using a Combination of Rules and Statistics

Error Analysis of Uyghur Name Tagging: Language-specific Techniques and Remaining Challenges.

Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

UzMorphAnalyser: A Morphological Analysis Model for the Uzbek Language Using Inflectional Endings

Joint Lemmatization and Morphological Tagging with LEMMING

Man-Machine Speech Communication

Towards an Optimal Solution to Lemmatization in Arabic

Learning Distributed Representations Of Uyghur Words And Morphemes

A Hybrid Model for Computational Morphology Application

Toward Better Loanword Identification in Uyghur Using Cross-lingual Word Embeddings.

A Statistical Method for Uyghur Tokenization

Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur

An Improved Method for Uyghur Sentence Similarity Computation

Mongolian Part-of-speech Tagging Approach Based on Conditional Random Fields