BERT-JAM: Maximizing the Utilization of BERT for Neural Machine Translation

Zhebin Zhang,Sai Wu,Dawei Jiang,Gang Chen
DOI: https://doi.org/10.1016/j.neucom.2021.07.002
IF: 6
2021-01-01
Neurocomputing
Abstract:Pre-training based approaches have been demonstrated effective for a wide range of natural language processing tasks. Leveraging BERT for neural machine translation (NMT), which we refer to as BERT enhanced NMT, has received increasing interest in recent years. However, there still exists a research gap in studying how to maximize the utilization of BERT for NMT tasks. Firstly, previous studies mostly focus on utilizing BERT's last-layer representation, neglecting the linguistic features encoded by the intermediate layers. Secondly, it requires further architectural exploration to integrate the BERT representation with the NMT encoder/decoder layers efficiently. And thirdly, existing methods keep the BERT parameters fixed during training to avoid the catastrophic forgetting problem, wasting the chances of boosting the performance via fine-tuning. In this paper, we propose BERT-JAM to fill the research gap from three aspects: 1) we equip BERT-JAM with fusion modules for composing BERT's multi-layer representations into a fused representation that can be leveraged by the NMT model, 2) BERT-JAM utilizes joint-attention modules to allow the BERT representation to be dynamically integrated with the encoder/decoder representations, and 3) we train BERT-JAM with a three-phase optimization strategy that progressively unfreezes different components to overcome catastrophic forgetting during fine-tuning. Experimental results show that BERT-JAM achieves state-of-the-art BLEU scores on multiple translation tasks. (c) 2021 Elsevier B.V. All rights reserved.
What problem does this paper attempt to address?