Language-Guided Visual Prompt Compensation for Multi-Modal Remote Sensing Image Classification with Modality Absence

Ling Huang,Wenqian Dong,Song Xiao,Jiahui Qu,Yuanbo Yang,Yunsong Li
DOI: https://doi.org/10.1145/3664647.3681563
2024-01-01
Abstract:Joint classification of multi-modal remote sensing images has achieved great success thanks to complementary advantages of multi-modal images. However, modality absence is a common dilemma in real world caused by imaging conditions, which leads to a breakdown of most classification methods that rely on complete modalities. Existing approaches either learn shared representations or train specific models for each absence case so that they commonly confront the difficulty of balancing the complementary advantages of the modalities and scalability of the absence case. In this paper, we propose a language-guided visual prompt compensation network (LVPCnet) to achieve joint classification in case of arbitrary modality absence using a unified model that simultaneously considers modality complementarity. It embeds missing modality-specific knowledge into visual prompts to guide the model in capturing complete modal information from available ones for classification. Specifically, a language-guided visual feature decoupling stage (LVFD-stage) is designed to extract shared and specific modal feature from multi-modal images, establishing a complementary representation model of complete modalities. Subsequently, an absence-aware visual prompt compensation stage (VPC-stage) is proposed to learn visual prompts containing missing modality-specific knowledge through cross-modal representation alignment, further guiding the complementary representation model to reconstruct modality-specific features for missing modalities from available ones based on the learned prompts. The proposed VPC-stage entails solely training visual prompts to perceive missing information without retraining the model, facilitating effective scalability to arbitrary modal missing scenarios. Systematic experiments conducted on three public datasets have validated the effectiveness of the proposed approach.
What problem does this paper attempt to address?