Abstract:Accurate semantic segmentation of remote sensing data plays a crucial role in the success of geoscience research and applications. Recently, multimodal fusion-based segmentation models have attracted much attention due to their outstanding performance as compared to conventional single-modal techniques. However, most of these models perform their fusion operation using convolutional neural networks (CNNs) or the vision transformer (Vit), resulting in insufficient local–global contextual modeling and representative capabilities. In this work, a multilevel multimodal fusion scheme called FTransUNet is proposed to provide a robust and effective multimodal fusion backbone for semantic segmentation by integrating both CNN and Vit into one unified fusion framework. First, the shallow-level features are first extracted and fused through convolutional layers and shallow-level feature fusion (SFF) modules. After that, deep-level features characterizing semantic information and spatial relationships are extracted and fused by a well-designed fusion Vit (FVit). It applies adaptively mutually boosted attention (Ada-MBA) layers and self-attention (SA) layers alternately in a three-stage scheme to learn cross-modality representations of high interclass separability and low intraclass variations. Specifically, the proposed Ada-MBA computes SA and cross-attention (CA) in parallel to enhance intra- and cross-modality contextual information simultaneously while steering attention distribution toward semantic-aware regions. As a result, FTransUNet can fuse shallow-level and deep-level features in a multilevel manner, taking full advantage of CNN and transformer to accurately characterize local details and global semantics, respectively. Extensive experiments confirm the superior performance of the proposed FTransUNet compared with other multimodal fusion approaches on two fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam. The source code in this work is available at https://github.com/sstary/SSRS.

DefFusion: Deformable Multimodal Representation Fusion for 3D Semantic Segmentation

CLFusion:3D Semantic Segmentation Based on Camera and Lidar Fusion

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

Transformer-Based Cross-Modal Information Fusion Network for Semantic Segmentation

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection

DLFusion: Painting-Depth Augmenting-LiDAR for Multimodal Fusion 3D Object Detection

Robust 3D Semantic Segmentation Method Based on Multi-Modal Collaborative Learning

Multiview Fusion Driven 3-D Point Cloud Semantic Segmentation Based on Hierarchical Transformer

An RGB-D Fusion Based Semantic Segmentation Algorithm Based on Neighborhood Metric Relations

CMDFusion: Bidirectional Fusion Network with Cross-modality Knowledge Distillation for LIDAR Semantic Segmentation

Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation

Transformer Fusion for Indoor Rgb-D Semantic Segmentation

Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation.

A Multi-phase Camera-LiDAR Fusion Network for 3D Semantic Segmentation with Weak Supervision

EPMF: Efficient Perception-Aware Multi-Sensor Fusion for 3D Semantic Segmentation

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

DefDeN: A Deformable Denoising-Based LiDAR and Camera Feature Fusion Model for 3D Object Detection

Robust 3D Semantic Segmentation Based on Multi-Phase Multi-Modal Fusion for Intelligent Vehicles