Parallel Corpus Augmentation using Masked Language Models

Vibhuti Kumari,Narayana Murthy Kavi
2024-10-04
Abstract:In this paper we propose a novel method of augmenting parallel text corpora which promises good quality and is also capable of producing many fold larger corpora than the seed corpus we start with. We do not need any additional monolingual corpora. We use Multi-Lingual Masked Language Model to mask and predict alternative words in context and we use Sentence Embeddings to check and select sentence pairs which are likely to be translations of each other. We cross check our method using metrics for MT Quality Estimation. We believe this method can greatly alleviate the data scarcity problem for all language pairs for which a reasonable seed corpus is available.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the data scarcity in parallel corpora, especially for neural machine translation (NMT) systems. Specifically, the paper proposes a new method to augment parallel text corpora, which can generate a large amount of new corpora with good quality and far exceeding the initial seed corpora in quantity. The paper emphasizes that the existing parallel corpora are usually insufficient to train high - quality NMT systems because manual translation or proofreading of machine - translation outputs is both time - consuming and costly. In addition, although automatically - constructed parallel corpora can increase the amount of data, their quality is often problematic. Therefore, the method proposed in the paper aims to alleviate this problem by using multilingual masked language models (such as XLM - RoBERTa) and sentence - embedding techniques to generate new parallel sentence pairs. The main contributions of the paper are as follows: 1. **No dependence on additional monolingual corpora**: Unlike many other parallel - corpus - augmentation methods, this method does not require additional monolingual corpora. 2. **High - quality parallel - corpus generation**: By masking and predicting alternative words and using sentence - embedding techniques to check the translational equivalence between sentence pairs, it is ensured that the generated parallel corpora have high quality. 3. **Applicability to multiple language pairs**: This method is applicable to any language pairs for which there are already reasonably - sized seed corpora, especially those language pairs with scarce data. In conclusion, the goal of the paper is to provide an effective and high - quality method for augmenting parallel corpora to support the training and development of neural machine - translation systems.