Abstract:Transformer neural models with multihead attentions outperform all existing translation models. Nevertheless, some features of traditional statistical models, such as prior alignment between source and target words, prove useful in training the state-of-the-art Transformer models. It has been reported that lightweight prior alignment can effectively guide a head in the multihead cross-attention sublayer responsible for the alignment of Transformer models. In this work, we make a step further by applying heavyweight prior alignments to guide all heads. Specifically, we use the weight of 0.5 for the alignment cost added to the token cost in formulating the overall cost of training a Transformer model, where the alignment cost is defined as the deviation of the attention probability from the prior alignments. Moreover, we increase the role of prior alignment, computing the attention probability by averaging all heads of the multihead attention sublayer within the penultimate layer of Transformer model. Experimental results on an English-Vietnamese translation task show that our proposed approach helps train superior Transformer-based translation models. Our Transformer model (25.71) outperforms the baseline model (21.34) by the large 4.37 BLEU. Case studies by native speakers on some translation results validate the machine judgment. The results so far encourage the use of heavyweight prior alignments to improve Transformer-based translation models. This work contributes to the literature on the machine translation, especially, for unpopular language pairs. Since the proposal in this work is language-independent, it can be applied to different language pairs, including Slavic languages.

Towards Better Word Alignment in Transformer.

Accurate Word Alignment Induction from Neural Machine Translation

Iterative Task-adaptive Pretraining for Unsupervised Word Alignment

Jointly Learning to Align and Translate with Transformer Models

Improving Neural Sentence Alignment with Word Translation.

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

Alignment-Enhanced Transformer for Constraining NMT with Pre-Specified Translations

Exploring the Relationship between Alignment and Cross-lingual Transfer in Multilingual Transformers

Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment

Combining Multiple Alignments to Improve Machine Translation.

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

Improving Word Alignment with Contextualized Embedding and Bilingual Dictionary.

Heavyweight Statistical Alignment to Guide Neural Translation

Improving Word Alignment by Semi-Supervised Ensemble.

Enhanced Pre-Trained Transformer with Aligned Attention Map for Text Matching

Improving domain-specific word alignment for computer assisted translation

A Closer Look at Transformer Attention for Multilingual Translation.

End-to-End Neural Word Alignment Outperforms GIZA++

Multilingual BERT-basedWord Alignment By Incorporating Common Chinese Characters

Word Alignment in the Era of Deep Learning: A Tutorial

When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?