Abstract:In this paper we propose a novel method of augmenting parallel text corpora which promises good quality and is also capable of producing many fold larger corpora than the seed corpus we start with. We do not need any additional monolingual corpora. We use Multi-Lingual Masked Language Model to mask and predict alternative words in context and we use Sentence Embeddings to check and select sentence pairs which are likely to be translations of each other. We cross check our method using metrics for MT Quality Estimation. We believe this method can greatly alleviate the data scarcity problem for all language pairs for which a reasonable seed corpus is available.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the data scarcity in parallel corpora, especially for neural machine translation (NMT) systems. Specifically, the paper proposes a new method to augment parallel text corpora, which can generate a large amount of new corpora with good quality and far exceeding the initial seed corpora in quantity. The paper emphasizes that the existing parallel corpora are usually insufficient to train high - quality NMT systems because manual translation or proofreading of machine - translation outputs is both time - consuming and costly. In addition, although automatically - constructed parallel corpora can increase the amount of data, their quality is often problematic. Therefore, the method proposed in the paper aims to alleviate this problem by using multilingual masked language models (such as XLM - RoBERTa) and sentence - embedding techniques to generate new parallel sentence pairs. The main contributions of the paper are as follows: 1. **No dependence on additional monolingual corpora**: Unlike many other parallel - corpus - augmentation methods, this method does not require additional monolingual corpora. 2. **High - quality parallel - corpus generation**: By masking and predicting alternative words and using sentence - embedding techniques to check the translational equivalence between sentence pairs, it is ensured that the generated parallel corpora have high quality. 3. **Applicability to multiple language pairs**: This method is applicable to any language pairs for which there are already reasonably - sized seed corpora, especially those language pairs with scarce data. In conclusion, the goal of the paper is to provide an effective and high - quality method for augmenting parallel corpora to support the training and development of neural machine - translation systems.

Parallel Corpus Augmentation using Masked Language Models

A Morphologically-Aware Dictionary-based Data Augmentation Technique for Machine Translation of Under-Represented Languages

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation

Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods.

Improving Machine Translation with Phrase Pair Injection and Corpus Filtering

Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

Iterative Mask Filling: An Effective Text Augmentation Method Using Masked Language Modeling

Data Augmentation for Low‐resource Languages NMT Guided by Constrained Sampling

Augmenting text for spoken language understanding with Large Language Models

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Building a Parallel Corpus and Training Translation Models Between Luganda and English

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

ExtraPhrase: Efficient Data Augmentation for Abstractive Summarization

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Generating Multilingual Parallel Corpus Using Subtitles

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

Sentence alignment using hybrid model

Machine Translation Model based on Non-parallel Corpus and Semi-supervised Transductive Learning

Automatic Parallel Corpus Creation for Hindi-English News Translation Task

Building a Large English-Chinese Parallel Corpus from Comparable Patents and Its Experimental Application to SMT