Abstract:Neural machine translation (NMT) has been prominent in many machine translation tasks. However, in some domain-specific tasks, only the corpora from similar domains can improve translation performance. If out-of-domain corpora are directly added into the in-domain corpus, the translation performance may even degrade. Therefore, domain adaptation techniques are essential to solve the NMT domain problem. Most existing methods for domain adaptation are designed for the conventional phrase-based machine translation. For NMT domain adaptation, there have been only a few studies on topics such as fine tuning, domain tags, and domain features. In this paper, we have four goals for sentence level NMT domain adaptation. First, the NMT's internal sentence embedding is exploited and the sentence embedding similarity is used to select out-of-domain sentences that are close to the in-domain corpus. Second, we propose three sentence weighting methods, i.e., sentence weighting, domain weighting, and batch weighting, to balance the data distribution during NMT training. Third, in addition, we propose dynamic training methods to adjust the sentence selection and weighting during NMT training. Fourth, to solve the multidomain problem in a real-world NMT scenario where the domain distributions of training and testing data often mismatch, we proposed a multidomain sentence weighting method to balance the domain distributions of training data and match the domain distributions of training and testing data. The proposed methods are evaluated in international workshop on spoken language translation (IWSLT) English-to-French/German tasks and a multidomain English-to-French task. Empirical results show that the sentence selection and weighting methods can significantly improve the NMT performance, outperforming the existing baselines.

Effective Selection of Translation Model Training Data.

Topic Model Based Adaptation Data Selection for Domain-Specific Machine Translation.

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

Transductive Data-Selection Algorithms for Fine-Tuning Neural Machine Translation

Reinforcing Language Model For Speech Translation With Auxiliary Data

Bilingual Recursive Neural Network Based Data Selection for Statistical Machine Translation

A Survey on Data Selection for Language Models

Data Selection Via Semi-supervised Recursive Autoencoders for SMT Domain Adaptation

A novel method to optimize training data for translation model adaptation

Dynamic Data Selection and Weighting for Iterative Back-Translation

Edit Distance: A New Data Selection Criterion for Domain Adaptation in SMT.

Icpe: A Hybrid Data Selection Model For Smt Domain Adaptation

Data Selection with Feature Decay Algorithms Using an Approximated Target Side

A Context-Aware Topic Model for Statistical Machine Translation.

Domain mining for machine translation

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Data Selection Curriculum for Neural Machine Translation

Adaptive development data selection for log-linear model in statistical machine translation

Sentence Selection and Weighting for Neural Machine Translation Domain Adaptation

Research on Translation Selection Based on Target Language Statistics

Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information.