Abstract:Domain-specific Multi-modal Neural Machine Translation (DMNMT) aims to translate domain-specific sentences from a source language to a target language by incorporating text-related visual information. Generally, domain-specific text-image data often complement each other and have the potential to collaboratively enhance the representation of domain-specific information. Unfortunately, there is a considerable modality gap between image and text in data format and semantic expression, which leads to distinctive challenges in domain-text translation tasks. Narrowing the modality gap and improving domain-aware representation are two critical challenges in DMNMT. To this end, this paper proposes a progressive modality-complement aggregative MultiTransformer, which aims to simultaneously narrow the modality gap and capture domain-specific multi-modal representation. We first adopt a bidirectional progressive cross-modal interactive strategy to effectively narrow the text-to-text, text-to-visual, and visual-to-text semantics in the multi-modal representation space by integrating visual and text information layer-by-layer. Subsequently, we introduce a modality-complement MultiTransformer based on progressive cross-modal interaction to extract the domain-related multi-modal representation, thereby enhancing machine translation performance. Experiment results on the Fashion-MMT and Multi-30k datasets are conducted, and the results show that the proposed approach outperforms the compared state-of-the-art (SOTA) methods on the En-Zh task in E-commerce domain, En-De, En-Fr and En-Cs tasks of Multi-30k in general domain. The in-depth analysis confirms the validity of the proposed modality-complement MultiTransformer and bidirectional progressive cross-modal interactive strategy for DMNMT.

Enhancing Neural Machine Translation with Dual-Side Multimodal Awareness

Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Multimodal Image-to-Image Translation via Mutual Information Estimation and Maximization

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

Mutual Information and Diverse Decoding Improve Neural Machine Translation.

Supervised Visual Attention for Simultaneous Multimodal Machine Translation

Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation

Bilingual–Visual Consistency for Multimodal Neural Machine Translation

Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation

Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

DAS-CL: Towards Multimodal Machine Translation Via Dual-Level Asymmetric Contrastive Learning

Unpaired Multimodal Neural Machine Translation via Reinforcement Learning

Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling

Multimodal Transformer For Multimodal Machine Translation

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

Visual Agreement Regularized Training for Multi-Modal Machine Translation

Multimodal Neural Machine Translation with Search Engine Based Image Retrieval

Progressive modality-complement aggregative multitransformer for domain multi-modal neural machine translation

Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets

Layer-Level Progressive Transformer With Modality Difference Awareness for Multi-Modal Neural Machine Translation