MMF-CLIP: An Image-Text Multimodal Semantic Segmentation Method for Remote Sensing Images

Xiqian Yang,Xili Wang,Xiyuan Wang
DOI: https://doi.org/10.1109/IGARSS53475.2024.10642152
2024-07-07
Abstract:The complexity of geospatial data and target variability in remote sensing imagery challenge semantic segmentation. To accurately segment dense targets, enhance feature distinguishability, and alleviate sample imbalance, we propose MMF-CLIP, a multimodal model based on CLIP. It includes a text encoder using CLIP pre-training to extract text features, an image encoder using SegFormer to extract image features, and a new decoder. The decoder fuses multiscale features by residual modules and channel attention mechanisms, aligns multimodal features by a high-resolution pixel-text score map, and enhances feature representation by fusing multimodal multi-scale features. Furthermore, the decoder is designed to generate attention weights by channel spatial attention mechanisms to optimize the distribution of multimodal features, and to mine complementary information from the two modalities. Experiments on publicly available remote sensing datasets show that the proposed method yields superior segmentation results on remote sensing imagery and outperforms the latest comparable segmentation methods.
Computer Science,Environmental Science
What problem does this paper attempt to address?