Abstract:The neural machine translation (NMT), which relies on a large training data (bilingual parallel sentences, for NMT) to obtain the state-of-the-art performance, is similar with deep learning. In order to construct NMT systems, the number of parallel sentences is very important. However, these bilingual resources are scarce for many low-resource language pairs. Although several works attempt to obtain bilingual parallel data from Internet, the quality and quantity of mined bilingual corpus are limited for low-resource language pairs. To address this problem, we propose the multi-view knowledge distillation model (MvKD) that use the knowledge of high-resource language pairs transfer into low-resource languages by leveraging internal language invariant in different languages. In particular, we treat the mining bilingual parallel sentence pair task as classifying task and use the multi-view classifier to detect bilingual parallel sentence pair. For multi-view classifier, we use two views to recognize the semantic difference of two sentences: (i) word-level representations and (ii) sentence-level representations. We encode sentence-level representations to capture semantically similar of two sentences. Moreover, we encode word-level representations to capture word translations in a pair of parallel sentences to avoid the problem that semantically similar but non-parallel sentences. Experimental results demonstrate that our proposed method can significantly mine amount of bilingual corpus and improve the quality of parallel sentences. In particular, we carry out the experiments on several real-world low-resource situations and achieve excellent results.

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation.

Improving Machine Translation with Phrase Pair Injection and Corpus Filtering

Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures

Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin

Boosting Unsupervised Machine Translation with Pseudo-Parallel Data

Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages

A Hybrid Approach for Improved Low Resource Neural Machine Translation using Monolingual Data

Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

Mining parallel sentences from internet with multi-view knowledge distillation for low-resource language pairs

Exploiting Monolingual Data at Scale for Neural Machine Translation.

A Survey on Low-Resource Neural Machine Translation

Handling Syntactic Divergence in Low-resource Machine Translation

Finding the Right Recipe for Low Resource Domain Adaptation in Neural Machine Translation

Leveraging Auxiliary Domain Parallel Data in Intermediate Task Fine-tuning for Low-resource Translation

A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations

A Morphologically-Aware Dictionary-based Data Augmentation Technique for Machine Translation of Under-Represented Languages

BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Towards Neural Machine Translation with Partially Aligned Corpora

An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation