Abstract:Image–text multimodal deep semantic segmentation leverages the fusion and alignment of image and text information and provides more prior knowledge for segmentation tasks. It is worth exploring image–text multimodal semantic segmentation for remote sensing images. In this paper, we propose a bidirectional feature fusion and enhanced alignment-based multimodal semantic segmentation model (BEMSeg) for remote sensing images. Specifically, BEMSeg first extracts image and text features by image and text encoders, respectively, and then the features are provided for fusion and alignment to obtain complementary multimodal feature representation. Secondly, a bidirectional feature fusion module is proposed, which employs self-attention and cross-attention to adaptively fuse image and text features of different modalities, thus reducing the differences between multimodal features. For multimodal feature alignment, the similarity between the image pixel features and text features is computed to obtain a pixel–text score map. Thirdly, we propose a category-based pixel-level contrastive learning on the score map to reduce the differences among the same category's pixels and increase the differences among the different categories' pixels, thereby enhancing the alignment effect. Additionally, a positive and negative sample selection strategy based on different images is explored during contrastive learning. Averaging pixel values across different training images for each category to set positive and negative samples compares global pixel information while also limiting sample quantity and reducing computational costs. Finally, the fused image features and aligned pixel–text score map are concatenated and fed into the decoder to predict the segmentation results. Experimental results on the ISPRS Potsdam, Vaihingen, and LoveDA datasets demonstrate that BEMSeg is superior to comparison methods on the Potsdam and Vaihingen datasets, with improvements in mIoU ranging from 0.57% to 5.59% and 0.48% to 6.15%, and compared with Transformer-based methods, BEMSeg also performs competitively on LoveDA dataset with improvements in mIoU ranging from 0.37% to 7.14%.

MMF-CLIP: An Image-Text Multimodal Semantic Segmentation Method for Remote Sensing Images

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Bidirectional Feature Fusion and Enhanced Alignment based Multimodal Semantic Segmentation for Remote Sensing Images

Text4Seg: Reimagining Image Segmentation as Text Generation

MMSMCNet: Modal Memory Sharing and Morphological Complementary Networks for RGB-T Urban Scene Semantic Segmentation

Object Segmentation by Mining Cross-Modal Semantics

Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

DiffCLIP: Few-shot Language-driven Multimodal Classifier

A semantic segmentation method for remote sensing images based on multiple contextual feature extraction

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images

CIMFNet: Cross-layer Interaction and Multiscale Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Images

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning