Layer-Level Progressive Transformer With Modality Difference Awareness for Multi-Modal Neural Machine Translation

Junjun Guo,Junjie Ye,Yan Xiang,Zhengtao Yu
DOI: https://doi.org/10.1109/taslp.2023.3301210
2023-01-01
Abstract:Multi-modal neural machine translation (MNMT) aims to translate sentences from the source language into the target language with the aid of corresponding images. Unfortunately, there is a considerable modality gap between the semantic-related images and texts in terms of data form and semantic expression. How to fully incorporate visual information into texts to enhance the performance of machine translation is one of the critical issues for MNMT. However, the initial visual and textual features are generally extracted with their modality-specific models; Consequently, there is a considerable representation gap between images and texts. Most previous MNMT works prefer only to adopt the feature-level fusion strategies to learn multi-modal representation, while the modality representation gap is often ignored. To this end, this article proposes a progressive multi-modal Transformer (ProMul-Trans) with Modality Difference-Aware (MDA) to address the visual-to-textual fusion problem raised in MNMT. We first employ MDA to capture the modality-consistency information by taking visual and textual representations as inputs in each Transformer layer. Then a layer-level progressive fusion (Layer-ProFusion) strategy is adopted to progressively align visual and textual representations layer-by-layer to enhance machine translation performance. Experiment results on the Multi30 k dataset are conducted, and the results show that the proposed approach outperforms the compared state-of-the-art (SOTA) methods on English $\to$ German (En $\to$ De), English $\to$ French (En $\to$ Fr) and English $\to$ Czech (En $\to$ Cs) tasks. We release the code at https://github.com/JunjieYe-MMT/HierProMul-Trans.
engineering, electrical & electronic,acoustics
What problem does this paper attempt to address?