Abstract:Bilingual lexicon extraction is useful, especially for low-resource languages that can leverage from high-resource languages. The Uyghur language is a derivative language, and its language resources are scarce and noisy. Moreover, it is difficult to find a bilingual resource to utilize the linguistic knowledge of other large resource languages, such as Chinese or English. There is little related research on unsupervised extraction for the Chinese-Uyghur languages, and the existing methods mainly focus on term extraction methods based on translated parallel corpora. Accordingly, unsupervised knowledge extraction methods are effective, especially for the low-resource languages. This paper proposes a method to extract a Chinese-Uyghur bilingual dictionary by combining the inter-word relationship matrix mapped by the neural network cross-language word embedding vector. A seed dictionary is used as a weak supervision signal. A small Chinese-Uyghur parallel data resource is used to map the multilingual word vectors into a unified vector space. As the word-particles of these two languages are not well-coordinated, stems are used as the main linguistic particles. The strong inter-word semantic relationship of word vectors is used to associate Chinese-Uyghur semantic information. Two retrieval indicators, such as nearest neighbor retrieval and cross-domain similarity local scaling, are used to calculate similarity to extract bilingual dictionaries. The experimental results show that the accuracy of the Chinese-Uyghur bilingual dictionary extraction method proposed in this paper is improved to 65.06%. This method helps to improve Chinese-Uyghur machine translation, automatic knowledge extraction, and multilingual translations.

Learning Distributed Representations Of Uyghur Words And Morphemes

Uyghur Morphological Segmentation with Bidirectional GRU Neural Networks

Co-learning of Word Representations and Morpheme Representations.

Improving Uyghur ASR systems with decoders using morpheme-based language models

Uyghur-Chinese statistical machine translation by incorporating morphological information

Implanting Rational Knowledge into Distributed Representation at Morpheme Level.

Learning Effective Word Embedding Using Morphological Word Similarity

Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

A Unified Framework for Jointly Learning Distributed Representations of Word and Attributes.

Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur

Construction of an English-Uyghur WordNet Dataset.

Voxel2vec: A Natural Language Processing Approach to Learning Distributed Representations for Scientific Data.

A Distributed Representation-Based Framework for Cross-Lingual Transfer Parsing.

DRWS: A Model for Learning Distributed Representations for Words and Sentences.

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations.

Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

Incorporating Linguistic Knowledge for Learning Distributed Word Representations.

Inside Out: Two Jointly Predictive Models For Word Representations And Phrase Representations

Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision

An Empirical Study of Emotion Analysis with Different Distributed Representation Methods for Chinese Microblogs

Learning Chinese word representation better by cascade morphological n-gram