Abstract:Accurate semantic segmentation of remote sensing data plays a crucial role in the success of geoscience research and applications. Recently, multimodal fusion-based segmentation models have attracted much attention due to their outstanding performance as compared to conventional single-modal techniques. However, most of these models perform their fusion operation using convolutional neural networks (CNNs) or the vision transformer (Vit), resulting in insufficient local–global contextual modeling and representative capabilities. In this work, a multilevel multimodal fusion scheme called FTransUNet is proposed to provide a robust and effective multimodal fusion backbone for semantic segmentation by integrating both CNN and Vit into one unified fusion framework. First, the shallow-level features are first extracted and fused through convolutional layers and shallow-level feature fusion (SFF) modules. After that, deep-level features characterizing semantic information and spatial relationships are extracted and fused by a well-designed fusion Vit (FVit). It applies adaptively mutually boosted attention (Ada-MBA) layers and self-attention (SA) layers alternately in a three-stage scheme to learn cross-modality representations of high interclass separability and low intraclass variations. Specifically, the proposed Ada-MBA computes SA and cross-attention (CA) in parallel to enhance intra- and cross-modality contextual information simultaneously while steering attention distribution toward semantic-aware regions. As a result, FTransUNet can fuse shallow-level and deep-level features in a multilevel manner, taking full advantage of CNN and transformer to accurately characterize local details and global semantics, respectively. Extensive experiments confirm the superior performance of the proposed FTransUNet compared with other multimodal fusion approaches on two fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam. The source code in this work is available at https://github.com/sstary/SSRS.

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Cooperative Connection Transformer for Remote Sensing Image Captioning

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

TypeFormer: Multiscale Transformer With Type Controller for Remote Sensing Image Caption

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning

Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text Retrieval

HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning

Progressive Scale-aware Network for Remote sensing Image Change Captioning

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

TSFNet: Triple-Steam Image Captioning

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Scene captioning with deep fusion of images and point clouds

Dual-level Collaborative Transformer for Image Captioning

CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation