Abstract:In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available.

What problem does this paper attempt to address?

This paper attempts to address two major challenges encountered in multi - modal visual recognition: 1. **Missing - modality problem**: In practical applications, during both the training and testing phases, there may be cases where one or some modalities of data are missing. For example, when dealing with a dataset containing images and texts, some samples may lack text descriptions or image content. 2. **Computational resource limitations**: Since multi - modal Transformer models are usually very large (possibly containing billions of parameters), fine - tuning them requires a large amount of computational resources, which is not feasible in many practical application scenarios. To solve the above problems, the authors propose a method based on prompt learning. By introducing "modality - missing - aware prompts", they can deal with the missing - modality problem in multi - modal data, and only need to fine - tune less than 1% of the model parameters, thus avoiding the expensive fine - tuning of the entire model. ### Main contributions - **Proposing a general scenario**: In this scenario, the modality - missing situation of each data sample can be different, and it can occur either during the training phase or the testing phase. - **Designing modality - missing - aware prompts**: These prompts can be dynamically adjusted according to different missing situations of the input data, helping the pre - trained model better handle the problem of missing modalities without the need to fine - tune the entire model. - **Studying two prompt integration methods**: Namely, adding prompts at the input layer and the attention layer. Experimental results show that input - layer prompts usually perform better, but on some datasets, attention - layer prompts may be more stable. ### Experimental results The authors conducted experiments on three multi - modal downstream tasks, namely movie genre classification (MM - IMDb), food classification (UPMC Food - 101), and hate - speech detection (Hateful Memes). The experimental results show that, compared with the baseline model, the method using modality - missing - aware prompts can significantly improve performance in various missing - modality situations, especially when computational resources are limited. ### Conclusion This paper effectively solves the problems of missing modalities and computational resource limitations in multi - modal visual recognition by introducing modality - missing - aware prompts, providing a new solution for the popularization of multi - modal learning in practical applications.

Multimodal Prompting with Missing Modalities for Visual Recognition

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

Deep Correlated Prompting for Visual Recognition with Missing Modalities

Towards Robust Multimodal Prompting With Missing Modalities

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

Multi-Source Augmentation and Composite Prompts for Visual Recognition with Missing Modality

Visual Prompt Flexible-Modal Face Anti-Spoofing

Modality-invariant and Specific Prompting for Multimodal Human Perception Understanding

ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models

Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Multi-Prompt with Depth Partitioned Cross-Modal Learning

Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Adaptive Multi-Modality Prompt Learning

Tuning Multi-mode Token-level Prompt Alignment across Modalities

Mutual Prompt Leaning for Vision Language Models

Visual Prompt Multi-Modal Tracking

Conditional Prompt Tuning for Multimodal Fusion

Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models