Abstract:Transformer neural models with multihead attentions outperform all existing translation models. Nevertheless, some features of traditional statistical models, such as prior alignment between source and target words, prove useful in training the state-of-the-art Transformer models. It has been reported that lightweight prior alignment can effectively guide a head in the multihead cross-attention sublayer responsible for the alignment of Transformer models. In this work, we make a step further by applying heavyweight prior alignments to guide all heads. Specifically, we use the weight of 0.5 for the alignment cost added to the token cost in formulating the overall cost of training a Transformer model, where the alignment cost is defined as the deviation of the attention probability from the prior alignments. Moreover, we increase the role of prior alignment, computing the attention probability by averaging all heads of the multihead attention sublayer within the penultimate layer of Transformer model. Experimental results on an English-Vietnamese translation task show that our proposed approach helps train superior Transformer-based translation models. Our Transformer model (25.71) outperforms the baseline model (21.34) by the large 4.37 BLEU. Case studies by native speakers on some translation results validate the machine judgment. The results so far encourage the use of heavyweight prior alignments to improve Transformer-based translation models. This work contributes to the literature on the machine translation, especially, for unpopular language pairs. Since the proposal in this work is language-independent, it can be applied to different language pairs, including Slavic languages.

Transformation from Discontinuous to Continuous Word Alignment Improves Translation Quality

Enhancing Statistical Machine Translation with Character Alignment

Combining Multiple Alignments to Improve Machine Translation.

Word Alignment Combination over Multiple Word Segmentation

Consistency-Aware Search for Word Alignment.

Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages.

Improving domain-specific word alignment for computer assisted translation

Improving statistical word alignment with a rule-based machine translation system

Improving Statistical Machine Translation with monolingual collocation

Weighted Alignment Matrices for Statistical Machine Translation.

Improving Statistical Word Alignment with Various Clues.

Discriminative Word Alignment over Multiple Word Segmentations

Comparative Study of Word Alignment Heuristics and Phrase-Based SMT

Optimizing Word Alignment Combination for Phrase Table Training

Improving Word Alignment by Semi-Supervised Ensemble.

Towards Integrated Machine Translation Using Structural Alignment From Syntax-Augmented Synchronous Parsing

Heavyweight Statistical Alignment to Guide Neural Translation

Extracting Hierarchical Rules from a Weighted Alignment Matrix.

Finding Better Subword Segmentation for Neural Machine Translation

Better Simultaneous Translation with Monotonic Knowledge Distillation.

Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings