Abstract:Machine translation is rife with ambiguities in word ordering and word choice, and even with the advent of machine-learning methods that learn to resolve this ambiguity based on statistics from large corpora, mistakes are frequent. Multi-source translation is an approach that attempts to resolve these ambiguities by exploiting multiple inputs (e.g. sentences in three different languages) to increase translation accuracy. These methods are trained on multilingual corpora, which include the multiple source languages and the target language, and then at test time uses information from both source languages while generating the target. While there are many of these multilingual corpora, such as multilingual translations of TED talks or European parliament proceedings, in practice, many multilingual corpora are not complete due to the difficulty to provide translations in all of the relevant languages. Existing studies on multi-source translation did not explicitly handle such situations, and thus are only applicable to complete corpora that have all of the languages of interest, severely limiting their practical applicability. In this article, we examine approaches for multi-source neural machine translation (NMT) that can learn from and translate such incomplete corpora. Specifically, we propose methods to deal with incomplete corpora at both training time and test time. For training time, we examine two methods: (1) a simple method that simply replaces missing source translations with a special NULL symbol, and (2) a data augmentation approach that fills in incomplete parts with source translations created from multi-source NMT. For test-time, we examine methods that use multi-source translation even when only a single source is provided by first translating into an additional auxiliary language using standard NMT, then using multi-source translation on the original source and this generated auxiliary language sentence. Extensive exp-riments demonstrate that the proposed training-time and test-time methods both significantly improve translation performance.

Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation

A Scenario-Generic Neural Machine Translation Data Augmentation Method

Handling Syntactic Divergence in Low-resource Machine Translation

Towards Neural Machine Translation with Partially Aligned Corpora

Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

Neural System Combination For Machine Translation

Asynchronous and Segmented Bidirectional Encoding for NMT

Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation

Parallel Corpus Augmentation using Masked Language Models

Sentence Alignment with Parallel Documents Facilitates Biomedical Machine Translation

A Corpus for English-Japanese Multimodal Neural Machine Translation with Comparable Sentences

AdvAug: Robust Adversarial Augmentation for Neural Machine Translation

Translation of Patent Sentences with a Large Vocabulary of Technical Terms Using Neural Machine Translation

Data Augmentation for Low‐resource Languages NMT Guided by Constrained Sampling

Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation

Multiple Segmentations of Thai Sentences for Neural Machine Translation

Semi-Supervised Neural Machine Translation Via Marginal Distribution Estimation

Multi-Source Neural Machine Translation With Missing Data

Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmentation

TAMS: Translation-Assisted Morphological Segmentation

Building a Parallel Corpus and Training Translation Models Between Luganda and English