Abstract:In multimodal learning for visual recognition, missing modality is a common issue that can significantly impact the performance and robustness of vision-language models. Most existing approaches have only considered the situation where a single modality-either image or text-is missing and then use a data augmentation method to recover the missing modality data. However, in reality, it is common for either text or image to be missing, and in such cases, a data augmentation method that is effective for one modality might not be suitable for the other, thereby necessitating distinct methods for text and image data augmentation. There are also approaches aimed at enhancing the robustness of vision-language models to handle missing data inputs. However since most of these approaches often involve significant modifications to complex model structures and require extensive retraining, these solutions would be impractical with limited computational resources. To address the abovementioned limitations, we develop a Multi-source Augmentation and Composite Prompts method (MACP) to alleviate the performance degradation due to missing modalities from both data and model levels. On the data level, we designed a multi-source data augmentation framework that integrates different data augmentation methods and a data selector to restore the missing data for each image-text sample as well as possible. On the model level, we designed a method for generating prompt vectors that simultaneously indicate the missing modalities in the model input and the source of augmentation data. The prompts will enhance the ability of the vision-language model to handle different input types in low-resource situations by applying prompt tuning. Experimental results demonstrate the effectiveness of our approach in mitigating the impact of modality missing on three vision-language datasets. Code is available.

Language-Guided Visual Prompt Compensation for Multi-Modal Remote Sensing Image Classification with Modality Absence

Deep Correlated Prompting for Visual Recognition with Missing Modalities

Multimodal Prompting with Missing Modalities for Visual Recognition

Multi-Source Augmentation and Composite Prompts for Visual Recognition with Missing Modality

Modal-aware Visual Prompting for Incomplete Multi-modal Brain Tumor Segmentation

MSH-Net: Modality-Shared Hallucination With Joint Adaptation Distillation for Remote Sensing Image Classification Using Missing Modalities

PromptCD: Coupled and Decoupled Prompt Learning for Vision-Language Models

Mutual Prompt Leaning for Vision Language Models

Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

Multi-modal Attribute Prompting for Vision-Language Models

Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Towards Robust Multimodal Prompting With Missing Modalities

Visual Prompt Flexible-Modal Face Anti-Spoofing

VTPL: Visual and Text Prompt Learning for Visual-Language Models

Progressive Multi-modal Conditional Prompt Tuning

Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering

TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt

Diversity-Guided Distillation with Modality-Center Regularization for Robust Multimodal Remote Sensing Image Classification.

Visual Prompt Multi-Modal Tracking

Joint Classification of Hyperspectral Image and LiDAR Data Based on Spectral Prompt Tuning