Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning

Jian Lang,Zhangtao Cheng,Ting Zhong,Fan Zhou
2025-01-02
Abstract:Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT's robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. The code of our work and prompt-based baselines is available at <a class="link-external link-https" href="https://github.com/Jian-Lang/RAGPT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of dealing with incomplete modal data in multimodal learning. Specifically, when the data of some modalities are missing, the existing multimodal learning methods (such as joint learning methods, cross - modal generation methods, and prompt - based methods) encounter the following problems when dealing with these incomplete data: 1. **Limited Modal Information**: The remaining modalities usually provide limited modal cues and cannot effectively deal with specific tasks, especially when the missing modality contains key information. 2. **Noise Introduced by Virtual Filling**: Incomplete modal inputs are usually filled with virtual values (such as empty strings or pixels), which may lead to information loss and introduce noise, thereby reducing the performance of the model. 3. **Static Prompts Lack Instance - awareness**: Static prompts share the same prompt tokens for all inputs, so they do not have instance - awareness ability and cannot provide sufficient knowledge for different types of missing modalities. To address these problems, the paper proposes a new framework named RAGPT (Retrieval - Augmented Dynamic Prompt Tuning). RAGPT enhances the robustness of the pre - trained multimodal Transformer (MMT) in the following ways: - **Multi - channel Retriever**: Identify similar instances through the intra - modal retrieval strategy to obtain context information related to the missing modality. - **Missing Modality Generator**: Use the retrieved context to restore the missing information, ensuring that the generated content is consistent with the input format of the pre - trained MMT. - **Context - aware Prompter**: Capture semantic associations from related instances and generate dynamic prompts to adapt to different input situations, thereby improving the robustness of the model. ### Formula Representation 1. **Text Similarity Calculation** \[ CR_i=\text{Top - K}_{r\in B}\left(\frac{E^t_i^{\top}E^t_r}{\|E^t_i\|\cdot\|E^t_r\|}\right) \] where \(E^t_i\) and \(E^t_r\) are the text representations of the target instance and the instances in the memory bank respectively, and \(CR_i\) is the top - K most similar text instances. 2. **Cross - attention Mechanism to Generate Text - level Comprehensive Representation** \[ \tilde{P}^t_i = \text{Att}(f_Q^t(W_i), f_K^t(W_R^i), f_V^t(W_R^i)) \] \[ \text{Att}(Q, K, V)=\text{Softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V \] where \(f_Q^t(.), f_K^t(.), f_V^t(.)\) are the projection functions of query, key and value respectively, and \(d\) is the embedding dimension. 3. **Frequency - domain Filtering** \[ Z_i = F(\bar{W}_i) \] \[ \tilde{Z}_i=W\odot Z_i \] \[ \tilde{W}_i = F^{-1}(\tilde{Z}_i) \] \[ \hat{W}_i=\text{LayerNorm}(\tilde{W}_i+\text{Dropout}(\tilde{W}_i)) \] where \(F(\cdot)\) represents the one - dimensional fast Fourier transform (FFT), and \(F^{-1}(\cdot)\) represents the inverse FFT. ### Summary RAGPT solves the bottlenecks encountered by existing methods in dealing with incomplete modal data by introducing a retrieval - enhanced dynamic prompt tuning framework, significantly improving the multimodal learning model in the case of missing modalities.