Abstract:Accurate semantic segmentation of remote sensing data plays a crucial role in the success of geoscience research and applications. Recently, multimodal fusion-based segmentation models have attracted much attention due to their outstanding performance as compared to conventional single-modal techniques. However, most of these models perform their fusion operation using convolutional neural networks (CNNs) or the vision transformer (Vit), resulting in insufficient local–global contextual modeling and representative capabilities. In this work, a multilevel multimodal fusion scheme called FTransUNet is proposed to provide a robust and effective multimodal fusion backbone for semantic segmentation by integrating both CNN and Vit into one unified fusion framework. First, the shallow-level features are first extracted and fused through convolutional layers and shallow-level feature fusion (SFF) modules. After that, deep-level features characterizing semantic information and spatial relationships are extracted and fused by a well-designed fusion Vit (FVit). It applies adaptively mutually boosted attention (Ada-MBA) layers and self-attention (SA) layers alternately in a three-stage scheme to learn cross-modality representations of high interclass separability and low intraclass variations. Specifically, the proposed Ada-MBA computes SA and cross-attention (CA) in parallel to enhance intra- and cross-modality contextual information simultaneously while steering attention distribution toward semantic-aware regions. As a result, FTransUNet can fuse shallow-level and deep-level features in a multilevel manner, taking full advantage of CNN and transformer to accurately characterize local details and global semantics, respectively. Extensive experiments confirm the superior performance of the proposed FTransUNet compared with other multimodal fusion approaches on two fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam. The source code in this work is available at https://github.com/sstary/SSRS.

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Multiscale 3-D-2-D Mixed CNN and Lightweight Attention-Free Transformer for Hyperspectral and LiDAR Classification

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

TransFG: A Transformer Architecture for Fine-Grained Recognition

Multi-level information fusion Transformer with background filter for fine-grained image recognition

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

Multimodal Fusion Transformer for Remote Sensing Image Classification

SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization.

FET-FGVC: Feature-enhanced transformer for fine-grained visual classification

Multimodal Token Fusion for Vision Transformers

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

AA-Trans: Core Attention Aggregating Transformer with Information Entropy Selector for Fine-grained Visual Classification

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

A multimodal hyper-fusion transformer for remote sensing image classification

MFST: Multi-Modal Feature Self-Adaptive Transformer for Infrared and Visible Image Fusion

MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation

MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification