Abstract:Bilingual lexicon extraction is useful, especially for low-resource languages that can leverage from high-resource languages. The Uyghur language is a derivative language, and its language resources are scarce and noisy. Moreover, it is difficult to find a bilingual resource to utilize the linguistic knowledge of other large resource languages, such as Chinese or English. There is little related research on unsupervised extraction for the Chinese-Uyghur languages, and the existing methods mainly focus on term extraction methods based on translated parallel corpora. Accordingly, unsupervised knowledge extraction methods are effective, especially for the low-resource languages. This paper proposes a method to extract a Chinese-Uyghur bilingual dictionary by combining the inter-word relationship matrix mapped by the neural network cross-language word embedding vector. A seed dictionary is used as a weak supervision signal. A small Chinese-Uyghur parallel data resource is used to map the multilingual word vectors into a unified vector space. As the word-particles of these two languages are not well-coordinated, stems are used as the main linguistic particles. The strong inter-word semantic relationship of word vectors is used to associate Chinese-Uyghur semantic information. Two retrieval indicators, such as nearest neighbor retrieval and cross-domain similarity local scaling, are used to calculate similarity to extract bilingual dictionaries. The experimental results show that the accuracy of the Chinese-Uyghur bilingual dictionary extraction method proposed in this paper is improved to 65.06%. This method helps to improve Chinese-Uyghur machine translation, automatic knowledge extraction, and multilingual translations.

Toward Better Loanword Identification in Uyghur Using Cross-lingual Word Embeddings.

Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision

Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

Cross-Language Sensitive Words Distribution Map: A Novel Recognition-Based Document Understanding Method for Uighur and Tibetan

Learning Distributed Representations Of Uyghur Words And Morphemes

Uyghur Morphological Segmentation with Bidirectional GRU Neural Networks

Uyghur-Chinese statistical machine translation by incorporating morphological information

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

Improving Uyghur ASR systems with decoders using morpheme-based language models

Enhance word representation for out-of-vocabulary on Ubuntu dialogue corpus

English-Chinese Bi-Directional OOV Translation based on Web Mining and Supervised Learning.

Neural Cross-Lingual Named Entity Recognition with Minimal Resources

Word Level Script Recognition for Uighur Document Mixed with English Script.

Finding Better Subword Segmentation for Neural Machine Translation

Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur

A Multilingual Language Processing Tool for Uyghur, Kazak and Kirghiz

A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning.

Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context

Learning Chinese Word Embeddings from Stroke, Structure and Pinyin of Characters