Multi-Source Augmentation and Composite Prompts for Visual Recognition with Missing Modality

Li Kuang,Qi Xie,Yulu Zhou,Zhirui Kuai
DOI: https://doi.org/10.1145/3652583.3658105
2024-05-30
Abstract:In multimodal learning for visual recognition, missing modality is a common issue that can significantly impact the performance and robustness of vision-language models. Most existing approaches have only considered the situation where a single modality-either image or text-is missing and then use a data augmentation method to recover the missing modality data. However, in reality, it is common for either text or image to be missing, and in such cases, a data augmentation method that is effective for one modality might not be suitable for the other, thereby necessitating distinct methods for text and image data augmentation. There are also approaches aimed at enhancing the robustness of vision-language models to handle missing data inputs. However since most of these approaches often involve significant modifications to complex model structures and require extensive retraining, these solutions would be impractical with limited computational resources. To address the abovementioned limitations, we develop a Multi-source Augmentation and Composite Prompts method (MACP) to alleviate the performance degradation due to missing modalities from both data and model levels. On the data level, we designed a multi-source data augmentation framework that integrates different data augmentation methods and a data selector to restore the missing data for each image-text sample as well as possible. On the model level, we designed a method for generating prompt vectors that simultaneously indicate the missing modalities in the model input and the source of augmentation data. The prompts will enhance the ability of the vision-language model to handle different input types in low-resource situations by applying prompt tuning. Experimental results demonstrate the effectiveness of our approach in mitigating the impact of modality missing on three vision-language datasets. Code is available.
Computer Science
What problem does this paper attempt to address?