Abstract:Accurate semantic segmentation of remote sensing data plays a crucial role in the success of geoscience research and applications. Recently, multimodal fusion-based segmentation models have attracted much attention due to their outstanding performance as compared to conventional single-modal techniques. However, most of these models perform their fusion operation using convolutional neural networks (CNNs) or the vision transformer (Vit), resulting in insufficient local–global contextual modeling and representative capabilities. In this work, a multilevel multimodal fusion scheme called FTransUNet is proposed to provide a robust and effective multimodal fusion backbone for semantic segmentation by integrating both CNN and Vit into one unified fusion framework. First, the shallow-level features are first extracted and fused through convolutional layers and shallow-level feature fusion (SFF) modules. After that, deep-level features characterizing semantic information and spatial relationships are extracted and fused by a well-designed fusion Vit (FVit). It applies adaptively mutually boosted attention (Ada-MBA) layers and self-attention (SA) layers alternately in a three-stage scheme to learn cross-modality representations of high interclass separability and low intraclass variations. Specifically, the proposed Ada-MBA computes SA and cross-attention (CA) in parallel to enhance intra- and cross-modality contextual information simultaneously while steering attention distribution toward semantic-aware regions. As a result, FTransUNet can fuse shallow-level and deep-level features in a multilevel manner, taking full advantage of CNN and transformer to accurately characterize local details and global semantics, respectively. Extensive experiments confirm the superior performance of the proposed FTransUNet compared with other multimodal fusion approaches on two fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam. The source code in this work is available at https://github.com/sstary/SSRS.

MMT: Mixed-Mask Transformer for Remote Sensing Image Semantic Segmentation

Mixed Transformer U-Net for Medical Image Segmentation

Maskformer with Improved Encoder-Decoder Module for Semantic Segmentation of Fine-Resolution Remote Sensing Images.

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

S-MAT: Semantic-Driven Masked Attention Transformer for Multi-Label Aerial Image Classification.

Masked-attention Mask Transformer for Universal Image Segmentation

MCAT-UNet: Convolutional and Cross-Shaped Window Attention Enhanced UNet for Efficient High-Resolution Remote Sensing Image Segmentation

MST: Masked Self-Supervised Transformer for Visual Representation

Masked Topology Convolutional Network for Classification and Segmentation of Remote Sensing Images

MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images

Masked Feature Modeling for Generative Self-Supervised Representation Learning of High-Resolution Remote Sensing Images

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

MSMT-LCL: Multiscale Spatial-Spectral Masked Transformer With Local Contrastive Learning for Hyperspectral Image Classification

Learning Content-enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation

Memory-Augmented Transformer for Remote Sensing Image Semantic Segmentation

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

MMFormer: Multimodal Transformer Using Multiscale Self-Attention for Remote Sensing Image Classification

Efficient Transformer for Remote Sensing Image Segmentation

Multi-Swin Mask Transformer for Instance Segmentation of Agricultural Field Extraction

When zero-padding position encoding encounters linear space reduction attention: an efficient semantic segmentation Transformer of remote sensing images

MDMASNet: A dual-task interactive semi-supervised remote sensing image segmentation method