Abstract:Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT's robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. The code of our work and prompt-based baselines is available at <a class="link-external link-https" href="https://github.com/Jian-Lang/RAGPT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of dealing with incomplete modal data in multimodal learning. Specifically, when the data of some modalities are missing, the existing multimodal learning methods (such as joint learning methods, cross - modal generation methods, and prompt - based methods) encounter the following problems when dealing with these incomplete data: 1. **Limited Modal Information**: The remaining modalities usually provide limited modal cues and cannot effectively deal with specific tasks, especially when the missing modality contains key information. 2. **Noise Introduced by Virtual Filling**: Incomplete modal inputs are usually filled with virtual values (such as empty strings or pixels), which may lead to information loss and introduce noise, thereby reducing the performance of the model. 3. **Static Prompts Lack Instance - awareness**: Static prompts share the same prompt tokens for all inputs, so they do not have instance - awareness ability and cannot provide sufficient knowledge for different types of missing modalities. To address these problems, the paper proposes a new framework named RAGPT (Retrieval - Augmented Dynamic Prompt Tuning). RAGPT enhances the robustness of the pre - trained multimodal Transformer (MMT) in the following ways: - **Multi - channel Retriever**: Identify similar instances through the intra - modal retrieval strategy to obtain context information related to the missing modality. - **Missing Modality Generator**: Use the retrieved context to restore the missing information, ensuring that the generated content is consistent with the input format of the pre - trained MMT. - **Context - aware Prompter**: Capture semantic associations from related instances and generate dynamic prompts to adapt to different input situations, thereby improving the robustness of the model. ### Formula Representation 1. **Text Similarity Calculation** \[ CR_i=\text{Top - K}_{r\in B}\left(\frac{E^t_i^{\top}E^t_r}{\|E^t_i\|\cdot\|E^t_r\|}\right) \] where \(E^t_i\) and \(E^t_r\) are the text representations of the target instance and the instances in the memory bank respectively, and \(CR_i\) is the top - K most similar text instances. 2. **Cross - attention Mechanism to Generate Text - level Comprehensive Representation** \[ \tilde{P}^t_i = \text{Att}(f_Q^t(W_i), f_K^t(W_R^i), f_V^t(W_R^i)) \] \[ \text{Att}(Q, K, V)=\text{Softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V \] where \(f_Q^t(.), f_K^t(.), f_V^t(.)\) are the projection functions of query, key and value respectively, and \(d\) is the embedding dimension. 3. **Frequency - domain Filtering** \[ Z_i = F(\bar{W}_i) \] \[ \tilde{Z}_i=W\odot Z_i \] \[ \tilde{W}_i = F^{-1}(\tilde{Z}_i) \] \[ \hat{W}_i=\text{LayerNorm}(\tilde{W}_i+\text{Dropout}(\tilde{W}_i)) \] where \(F(\cdot)\) represents the one - dimensional fast Fourier transform (FFT), and \(F^{-1}(\cdot)\) represents the inverse FFT. ### Summary RAGPT solves the bottlenecks encountered by existing methods in dealing with incomplete modal data by introducing a retrieval - enhanced dynamic prompt tuning framework, significantly improving the multimodal learning model in the case of missing modalities.

Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning

Multimodal Prompting with Missing Modalities for Visual Recognition

UMP: Unified Modality-aware Prompt Tuning for Text-Video Retrieval

Towards Robust Multimodal Prompting With Missing Modalities

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

Progressive Multi-modal Conditional Prompt Tuning

Conditional Prompt Tuning for Multimodal Fusion

MPT: Multi-grained Prompt Tuning for Text-Video Retrieval

Dynamic Prompting: A Unified Framework for Prompt Tuning

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models

Deep Correlated Prompting for Visual Recognition with Missing Modalities

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Modal-aware Prompt Tuning with Deep Adaptive Feature Enhancement

Multitask Vision-Language Prompt Tuning

Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Prompt Tuning for Generative Multimodal Pretrained Models

Multi-Source Augmentation and Composite Prompts for Visual Recognition with Missing Modality

Unified Vision and Language Prompt Learning