Abstract:Domain-specific Multi-modal Neural Machine Translation (DMNMT) aims to translate domain-specific sentences from a source language to a target language by incorporating text-related visual information. Generally, domain-specific text-image data often complement each other and have the potential to collaboratively enhance the representation of domain-specific information. Unfortunately, there is a considerable modality gap between image and text in data format and semantic expression, which leads to distinctive challenges in domain-text translation tasks. Narrowing the modality gap and improving domain-aware representation are two critical challenges in DMNMT. To this end, this paper proposes a progressive modality-complement aggregative MultiTransformer, which aims to simultaneously narrow the modality gap and capture domain-specific multi-modal representation. We first adopt a bidirectional progressive cross-modal interactive strategy to effectively narrow the text-to-text, text-to-visual, and visual-to-text semantics in the multi-modal representation space by integrating visual and text information layer-by-layer. Subsequently, we introduce a modality-complement MultiTransformer based on progressive cross-modal interaction to extract the domain-related multi-modal representation, thereby enhancing machine translation performance. Experiment results on the Fashion-MMT and Multi-30k datasets are conducted, and the results show that the proposed approach outperforms the compared state-of-the-art (SOTA) methods on the En-Zh task in E-commerce domain, En-De, En-Fr and En-Cs tasks of Multi-30k in general domain. The in-depth analysis confirms the validity of the proposed modality-complement MultiTransformer and bidirectional progressive cross-modal interactive strategy for DMNMT.

Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Bilingual–Visual Consistency for Multimodal Neural Machine Translation

Layer-Level Progressive Transformer With Modality Difference Awareness for Multi-Modal Neural Machine Translation

Progressive modality-complement aggregative multitransformer for domain multi-modal neural machine translation

Enhancing Neural Machine Translation with Dual-Side Multimodal Awareness

Video Pivoting Unsupervised Multi-Modal Machine Translation

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

Multimodal Neural Machine Translation with Search Engine Based Image Retrieval

UNIMO: Towards Unified-Modal Understanding and Generation Via Cross-Modal Contrastive Learning

Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective

Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation

CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation

Exploring Multi-Stage Information Interactions for Multi-Source Neural Machine Translation

Visual Agreement Regularized Training for Multi-Modal Machine Translation

Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

CLIP-enhanced multimodal machine translation: integrating visual and label features with transformer fusion

DAS-CL: Towards Multimodal Machine Translation Via Dual-Level Asymmetric Contrastive Learning

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix