Abstract:Accurate semantic segmentation of remote sensing data plays a crucial role in the success of geoscience research and applications. Recently, multimodal fusion-based segmentation models have attracted much attention due to their outstanding performance as compared to conventional single-modal techniques. However, most of these models perform their fusion operation using convolutional neural networks (CNNs) or the vision transformer (Vit), resulting in insufficient local–global contextual modeling and representative capabilities. In this work, a multilevel multimodal fusion scheme called FTransUNet is proposed to provide a robust and effective multimodal fusion backbone for semantic segmentation by integrating both CNN and Vit into one unified fusion framework. First, the shallow-level features are first extracted and fused through convolutional layers and shallow-level feature fusion (SFF) modules. After that, deep-level features characterizing semantic information and spatial relationships are extracted and fused by a well-designed fusion Vit (FVit). It applies adaptively mutually boosted attention (Ada-MBA) layers and self-attention (SA) layers alternately in a three-stage scheme to learn cross-modality representations of high interclass separability and low intraclass variations. Specifically, the proposed Ada-MBA computes SA and cross-attention (CA) in parallel to enhance intra- and cross-modality contextual information simultaneously while steering attention distribution toward semantic-aware regions. As a result, FTransUNet can fuse shallow-level and deep-level features in a multilevel manner, taking full advantage of CNN and transformer to accurately characterize local details and global semantics, respectively. Extensive experiments confirm the superior performance of the proposed FTransUNet compared with other multimodal fusion approaches on two fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam. The source code in this work is available at https://github.com/sstary/SSRS.

A Remote Sensing Image Segmentation Model Based on Multi-Scale Feature Fusion

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

A ViT-Based Multiscale Feature Fusion Approach for Remote Sensing Image Segmentation

An Image Segmentation Method Based on Transformer and Multi-Scale Feature Fusion for UAV Marine Environment Monitoring

Remote Sensing Scene Image Classification Model Based on Multi-Scale Features and Attention Mechanism

Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval

Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

MCAFNet: A Multiscale Channel Attention Fusion Network for Semantic Segmentation of Remote Sensing Images

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation

MFVNet: a deep adaptive fusion network with multiple field-of-views for remote sensing image semantic segmentation

Remote Sensing Image Semantic Segmentation Method Based on a Deep Convolutional Neural Network and Multiscale Feature Fusion

Multi-scale Feature Extraction and Fusion Net: Research on UAVs Image Semantic Segmentation Technology

Multi-scale attention fusion network for semantic segmentation of remote sensing images

MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images

Encoder- and Decoder-Based Networks Using Multiscale Feature Fusion and Nonlocal Block for Remote Sensing Image Semantic Segmentation

Advancing Plain Vision Transformer Toward Remote Sensing Foundation Model