Abstract:Accurate semantic segmentation of remote sensing data plays a crucial role in the success of geoscience research and applications. Recently, multimodal fusion-based segmentation models have attracted much attention due to their outstanding performance as compared to conventional single-modal techniques. However, most of these models perform their fusion operation using convolutional neural networks (CNNs) or the vision transformer (Vit), resulting in insufficient local–global contextual modeling and representative capabilities. In this work, a multilevel multimodal fusion scheme called FTransUNet is proposed to provide a robust and effective multimodal fusion backbone for semantic segmentation by integrating both CNN and Vit into one unified fusion framework. First, the shallow-level features are first extracted and fused through convolutional layers and shallow-level feature fusion (SFF) modules. After that, deep-level features characterizing semantic information and spatial relationships are extracted and fused by a well-designed fusion Vit (FVit). It applies adaptively mutually boosted attention (Ada-MBA) layers and self-attention (SA) layers alternately in a three-stage scheme to learn cross-modality representations of high interclass separability and low intraclass variations. Specifically, the proposed Ada-MBA computes SA and cross-attention (CA) in parallel to enhance intra- and cross-modality contextual information simultaneously while steering attention distribution toward semantic-aware regions. As a result, FTransUNet can fuse shallow-level and deep-level features in a multilevel manner, taking full advantage of CNN and transformer to accurately characterize local details and global semantics, respectively. Extensive experiments confirm the superior performance of the proposed FTransUNet compared with other multimodal fusion approaches on two fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam. The source code in this work is available at https://github.com/sstary/SSRS.

TFRNet: Semantic Segmentation Network with Token Filtration and Refinement Method

Refined Division Features Based on Transformer for Semantic Image Segmentation

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Making Vision Transformers Efficient from A Token Sparsification View

Semantic segmentation using cross-stage feature reweighting and efficient self-attention

Dynamic Token-Pass Transformers for Semantic Segmentation

Remote Sensing Scene Classification via Second-Order Differentiable Token Transformer Network

LACTNet: A Lightweight Real-Time Semantic Segmentation Network Based on an Aggregated Convolutional Neural Network and Transformer

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

A feature refinement module for light-weight semantic segmentation network

MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images

Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images

FTUNet: A Feature-Enhanced Network for Medical Image Segmentation Based on the Combination of U-Shaped Network and Vision Transformer

SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation

MResTNet: A Multi-Resolution Transformer Framework with CNN Extensions for Semantic Segmentation

FSegNet: A Semantic Segmentation Network for High-Resolution Remote Sensing Images That Balances Efficiency and Performance

ACTNet: A Dual-Attention Adapter with a CNN-Transformer Network for the Semantic Segmentation of Remote Sensing Imagery

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

EFRNet: A Lightweight Network with Efficient Feature Fusion and Refinement for Real-Time Semantic Segmentation

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet