Abstract:As an extension of machine translation, the primary objective of multi-modal machine translation is to optimize the utilization of visual information. Technically, image information is integrated into multi-modal fusion and alignment as an auxiliary modality through concepts or latent semantics, which are typically based on the Transformer framework. However, current approaches often ignore one modality to design numerous handcrafted features (e.g. visual concept extraction) and require training of all parameters in their framework. Therefore, it is worthwhile to explore multi-modal concepts or features to enhance performance and an efficient approach to incorporate visual information with minimal cost. Meanwhile, with the development of multi-modal large language models (MLLMs), they are faced with the visual hallucination issue of compromising performance, despite their powerful capabilities. Inspired by pioneering techniques in the multi-modal field, such as prompt learning and MLLMs, this paper innovatively explores the possibility of applying multi-modal prompt learning to this multi-modal machine translation task. Our framework offers three key advantages: it establishes a robust connection between visual concepts and translation processes, requires a minimum of 1.46M parameters for training, and can be seamlessly integrated into any existing framework by retrieving a multi-modal dictionary. Specifically, we propose two prompt-guided strategies: a learnable prompt-refined module and a heuristic prompt-refined module. Among them, the learnable strategy utilizes off-the-shelf pre-trained models, while the heuristic strategy constrains the hallucination problem via concept retrieval. Our experiments on two real-world benchmark datasets demonstrate that our proposed method outperforms all competitors.

RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine Translation

m3P: Towards Multimodal Multilingual Translation with Multimodal Prompt

LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation

Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning

Multimodal Prompting with Missing Modalities for Visual Recognition

MPT4LM: Multi-Modal Prompt Tuning Makes Pre-Trained Large Language Models Better Vision-Language Learners

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Progressive Multi-modal Conditional Prompt Tuning

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models

Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation

MedPrompt: Cross-Modal Prompting for Multi-Task Medical Image Translation

MePT: Multi-Representation Guided Prompt Tuning for Vision-Language Model

Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment

A Retrospect to Multi-prompt Learning Across Vision and Language.

Tuning Multi-mode Token-level Prompt Alignment across Modalities

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

MmAP : Multi-modal Alignment Prompt for Cross-domain Multi-task Learning

Supervised Visual Attention for Simultaneous Multimodal Machine Translation

Cross-Lingual Transfer for Natural Language Inference via Multilingual Prompt Translator