Abstract:Recently, prompt learning has garnered considerable attention for its success in various Vision-Language (VL) tasks. However, existing prompt-based models are primarily focused on studying prompt generation and prompt strategies with complete modality settings, which does not accurately reflect real-world scenarios where partial modality information may be missing. In this paper, we present the first comprehensive investigation into prompt learning behavior when modalities are incomplete, revealing the high sensitivity of prompt-based models to missing modalities. To this end, we propose a novel Multi-step Adaptive Prompt Learning (MuAP) framework, aiming to generate multimodal prompts and perform multi-step prompt tuning, which adaptively learns knowledge by iteratively aligning modalities. Specifically, we generate multimodal prompts for each modality and devise prompt strategies to integrate them into the Transformer model. Subsequently, we sequentially perform prompt tuning from single-stage and alignment-stage, allowing each modality-prompt to be autonomously and adaptively learned, thereby mitigating the imbalance issue caused by only textual prompts that are learnable in previous works. Extensive experiments demonstrate the effectiveness of our MuAP and this model achieves significant improvements compared to the state-of-the-art on all benchmark datasets

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in vision - language (VL) tasks, when the modal data is incomplete, the existing prompt - based learning methods perform poorly. Specifically, most of the existing research assumes that all modalities are available during the training and testing stages, which is often difficult to achieve in real - world applications. For example, due to privacy and security issues, text data may not be accessible; or due to limitations of device observations, visual data may be missing. These problems lead to the absence of modal data, which significantly affects the performance of vision - language models. To solve this problem, the author proposes a multi - step adaptive prompt learning framework (Multi - step Adaptive Prompt Learning, MuAP), which aims to generate multi - modal prompts and perform multi - step prompt adjustment, and adaptively learn knowledge by iteratively aligning modalities. MuAP generates multi - modal prompts for each modality, designs prompt strategies to integrate them into the Transformer model, and then performs prompt adjustment successively from the single - stage and alignment stages, enabling each modality prompt to learn independently and adaptively, thus alleviating the imbalance problem caused by only text prompts being learnable in previous work. The main contributions of the paper include: 1. **For the first time, analyze the robustness of prompt learning in the case of missing modal data**, propose a new multi - step adaptive prompt learning method for handling missing modal data in VL models, solve the limitations of existing work, and enhance the effect of prompts simultaneously through autonomous and collaborative learning. 2. **Design a multi - step tuning strategy**, covering single - stage and alignment - stage tuning, adaptively generate visual and language prompts through multi - step modal alignment, so as to comprehensively learn knowledge from both modalities without bias. 3. **Conduct extensive experiments and ablation studies**, verify the effectiveness of MuAP on three benchmark datasets, and the results show that the model outperforms the current state - of - the - art methods on all benchmark datasets.

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Multimodal Prompting with Missing Modalities for Visual Recognition

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Progressive Multi-modal Conditional Prompt Tuning

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

Deep Correlated Prompting for Visual Recognition with Missing Modalities

Multi-modal Attribute Prompting for Vision-Language Models

Towards Robust Multimodal Prompting With Missing Modalities

MaPLe: Multi-modal Prompt Learning

ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

Adaptive Multi-Modality Prompt Learning

MCPL: Multi-modal Collaborative Prompt Learning for Medical Vision-Language Model

Multi-Source Augmentation and Composite Prompts for Visual Recognition with Missing Modality

Modality-invariant and Specific Prompting for Multimodal Human Perception Understanding

ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts

Multi-Prompt with Depth Partitioned Cross-Modal Learning

Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering

Mutual Prompt Leaning for Vision Language Models