Abstract:Accurate semantic segmentation of remote sensing data plays a crucial role in the success of geoscience research and applications. Recently, multimodal fusion-based segmentation models have attracted much attention due to their outstanding performance as compared to conventional single-modal techniques. However, most of these models perform their fusion operation using convolutional neural networks (CNNs) or the vision transformer (Vit), resulting in insufficient local–global contextual modeling and representative capabilities. In this work, a multilevel multimodal fusion scheme called FTransUNet is proposed to provide a robust and effective multimodal fusion backbone for semantic segmentation by integrating both CNN and Vit into one unified fusion framework. First, the shallow-level features are first extracted and fused through convolutional layers and shallow-level feature fusion (SFF) modules. After that, deep-level features characterizing semantic information and spatial relationships are extracted and fused by a well-designed fusion Vit (FVit). It applies adaptively mutually boosted attention (Ada-MBA) layers and self-attention (SA) layers alternately in a three-stage scheme to learn cross-modality representations of high interclass separability and low intraclass variations. Specifically, the proposed Ada-MBA computes SA and cross-attention (CA) in parallel to enhance intra- and cross-modality contextual information simultaneously while steering attention distribution toward semantic-aware regions. As a result, FTransUNet can fuse shallow-level and deep-level features in a multilevel manner, taking full advantage of CNN and transformer to accurately characterize local details and global semantics, respectively. Extensive experiments confirm the superior performance of the proposed FTransUNet compared with other multimodal fusion approaches on two fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam. The source code in this work is available at https://github.com/sstary/SSRS.

Transfer Representation Learning Meets Multimodal Fusion Classification for Remote Sensing Images

Fusing Deep Features by Kernel Collaborative Representation for Remote Sensing Scene Classification

Novel Cross-Resolution Feature-Level Fusion for Joint Classification of Multispectral and Panchromatic Remote Sensing Images

Intra- and Intersource Interactive Representation Learning Network for Remote Sensing Images Classification

Representation-Enhanced Status Replay Network for Multisource Remote-Sensing Image Classification

Multi-Scale Spectral-Spatial Attention Residual Fusion Network for Multi-Source Remote Sensing Data Classification

A multimodal hyper-fusion transformer for remote sensing image classification

Intra- and Inter-source Interactive Representation Learning Network for Remote Sensing Images Classification

A Unified Multimodal Deep Learning Framework for Remote Sensing Imagery Classification.

A Spatial-Channel Progressive Fusion ResNet for Remote Sensing Classification.

Learning transferable cross-modality representations for few-shot hyperspectral and LiDAR collaborative classification

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

Adaptive Multiscale Deep Fusion Residual Network for Remote Sensing Image Classification

MCFT: Multimodal Contrastive Fusion Transformer for Classification of Hyperspectral Image and LiDAR Data

More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

A Novel Adaptive Hybrid Fusion Network for Multiresolution Remote Sensing Images Classification

Deep Symmetric Fusion Transformer for Multimodal Remote Sensing Data Classification

TCPSNet: Transformer and Cross-Pseudo-Siamese Learning Network for Classification of Multi-Source Remote Sensing Images

A Novel MRF-Based Multifeature Fusion for Classification of Remote Sensing Images

Multisource Remote Sensing Data Classification with Graph Fusion Network