Abstract:Domain-specific Multi-modal Neural Machine Translation (DMNMT) aims to translate domain-specific sentences from a source language to a target language by incorporating text-related visual information. Generally, domain-specific text-image data often complement each other and have the potential to collaboratively enhance the representation of domain-specific information. Unfortunately, there is a considerable modality gap between image and text in data format and semantic expression, which leads to distinctive challenges in domain-text translation tasks. Narrowing the modality gap and improving domain-aware representation are two critical challenges in DMNMT. To this end, this paper proposes a progressive modality-complement aggregative MultiTransformer, which aims to simultaneously narrow the modality gap and capture domain-specific multi-modal representation. We first adopt a bidirectional progressive cross-modal interactive strategy to effectively narrow the text-to-text, text-to-visual, and visual-to-text semantics in the multi-modal representation space by integrating visual and text information layer-by-layer. Subsequently, we introduce a modality-complement MultiTransformer based on progressive cross-modal interaction to extract the domain-related multi-modal representation, thereby enhancing machine translation performance. Experiment results on the Fashion-MMT and Multi-30k datasets are conducted, and the results show that the proposed approach outperforms the compared state-of-the-art (SOTA) methods on the En-Zh task in E-commerce domain, En-De, En-Fr and En-Cs tasks of Multi-30k in general domain. The in-depth analysis confirms the validity of the proposed modality-complement MultiTransformer and bidirectional progressive cross-modal interactive strategy for DMNMT.

Make the Blind Translator See The World: A Novel Transfer Learning Solution for Multimodal Machine Translation.

Generalization algorithm of multimodal pre-training model based on graph-text self-supervised training

CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation

Multilingual Multimodal Learning with Machine Translated Text

Supervised Visual Attention for Simultaneous Multimodal Machine Translation

Layer-Level Progressive Transformer With Modality Difference Awareness for Multi-Modal Neural Machine Translation

Multimodal Pretraining from Monolingual to Multilingual

Multimodal Transformer For Multimodal Machine Translation

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Progressive modality-complement aggregative multitransformer for domain multi-modal neural machine translation

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Enhancing Neural Machine Translation with Dual-Side Multimodal Awareness

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation

Adding Multimodal Capabilities to a Text-only Translation Model

TCT: A Cross-supervised Learning Method for Multimodal Sequence Representation

Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts