Multimodal Prompting with Missing Modalities for Visual Recognition

Yi-Lun Lee,Yi-Hsuan Tsai,Wei-Chen Chiu,Chen-Yu Lee
2023-03-10
Abstract:In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address two major challenges encountered in multi - modal visual recognition: 1. **Missing - modality problem**: In practical applications, during both the training and testing phases, there may be cases where one or some modalities of data are missing. For example, when dealing with a dataset containing images and texts, some samples may lack text descriptions or image content. 2. **Computational resource limitations**: Since multi - modal Transformer models are usually very large (possibly containing billions of parameters), fine - tuning them requires a large amount of computational resources, which is not feasible in many practical application scenarios. To solve the above problems, the authors propose a method based on prompt learning. By introducing "modality - missing - aware prompts", they can deal with the missing - modality problem in multi - modal data, and only need to fine - tune less than 1% of the model parameters, thus avoiding the expensive fine - tuning of the entire model. ### Main contributions - **Proposing a general scenario**: In this scenario, the modality - missing situation of each data sample can be different, and it can occur either during the training phase or the testing phase. - **Designing modality - missing - aware prompts**: These prompts can be dynamically adjusted according to different missing situations of the input data, helping the pre - trained model better handle the problem of missing modalities without the need to fine - tune the entire model. - **Studying two prompt integration methods**: Namely, adding prompts at the input layer and the attention layer. Experimental results show that input - layer prompts usually perform better, but on some datasets, attention - layer prompts may be more stable. ### Experimental results The authors conducted experiments on three multi - modal downstream tasks, namely movie genre classification (MM - IMDb), food classification (UPMC Food - 101), and hate - speech detection (Hateful Memes). The experimental results show that, compared with the baseline model, the method using modality - missing - aware prompts can significantly improve performance in various missing - modality situations, especially when computational resources are limited. ### Conclusion This paper effectively solves the problems of missing modalities and computational resource limitations in multi - modal visual recognition by introducing modality - missing - aware prompts, providing a new solution for the popularization of multi - modal learning in practical applications.