Abstract:Recently, prompt learning has garnered considerable attention for its success in various Vision-Language (VL) tasks. However, existing prompt-based models are primarily focused on studying prompt generation and prompt strategies with complete modality settings, which does not accurately reflect real-world scenarios where partial modality information may be missing. In this paper, we present the first comprehensive investigation into prompt learning behavior when modalities are incomplete, revealing the high sensitivity of prompt-based models to missing modalities. To this end, we propose a novel Multi-step Adaptive Prompt Learning (MuAP) framework, aiming to generate multimodal prompts and perform multi-step prompt tuning, which adaptively learns knowledge by iteratively aligning modalities. Specifically, we generate multimodal prompts for each modality and devise prompt strategies to integrate them into the Transformer model. Subsequently, we sequentially perform prompt tuning from single-stage and alignment-stage, allowing each modality-prompt to be autonomously and adaptively learned, thereby mitigating the imbalance issue caused by only textual prompts that are learnable in previous works. Extensive experiments demonstrate the effectiveness of our MuAP and this model achieves significant improvements compared to the state-of-the-art on all benchmark datasets
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in vision - language (VL) tasks, when the modal data is incomplete, the existing prompt - based learning methods perform poorly. Specifically, most of the existing research assumes that all modalities are available during the training and testing stages, which is often difficult to achieve in real - world applications. For example, due to privacy and security issues, text data may not be accessible; or due to limitations of device observations, visual data may be missing. These problems lead to the absence of modal data, which significantly affects the performance of vision - language models.
To solve this problem, the author proposes a multi - step adaptive prompt learning framework (Multi - step Adaptive Prompt Learning, MuAP), which aims to generate multi - modal prompts and perform multi - step prompt adjustment, and adaptively learn knowledge by iteratively aligning modalities. MuAP generates multi - modal prompts for each modality, designs prompt strategies to integrate them into the Transformer model, and then performs prompt adjustment successively from the single - stage and alignment stages, enabling each modality prompt to learn independently and adaptively, thus alleviating the imbalance problem caused by only text prompts being learnable in previous work.
The main contributions of the paper include:
1. **For the first time, analyze the robustness of prompt learning in the case of missing modal data**, propose a new multi - step adaptive prompt learning method for handling missing modal data in VL models, solve the limitations of existing work, and enhance the effect of prompts simultaneously through autonomous and collaborative learning.
2. **Design a multi - step tuning strategy**, covering single - stage and alignment - stage tuning, adaptively generate visual and language prompts through multi - step modal alignment, so as to comprehensively learn knowledge from both modalities without bias.
3. **Conduct extensive experiments and ablation studies**, verify the effectiveness of MuAP on three benchmark datasets, and the results show that the model outperforms the current state - of - the - art methods on all benchmark datasets.