Abstract:Geometric information in the normalized digital surface models (nDSM) is highly correlated with the semantic class of the land cover. Exploiting two modalities (RGB and nDSM (height)) jointly has great potential to improve the segmentation performance. However, it is still an under-explored field in remote sensing due to the following challenges. First, the scales of existing datasets are relatively small and the diversity of existing datasets is limited, which restricts the ability of validation. Second, there is a lack of unified benchmarks for performance assessment, which leads to difficulties in comparing the effectiveness of different models. Last, sophisticated multi-modal semantic segmentation methods have not been deeply explored for remote sensing data. To cope with these challenges, in this paper, we introduce a new remote-sensing benchmark dataset for multi-modal semantic segmentation based on RGB-Height (RGB-H) data. Towards a fair and comprehensive analysis of existing methods, the proposed benchmark consists of 1) a large-scale dataset including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a comprehensive evaluation and analysis of existing multi-modal fusion strategies for both convolutional and Transformer-based networks on remote sensing data. Furthermore, we propose a novel and effective Transformer-based intermediary multi-modal fusion (TIMF) module to improve the semantic segmentation performance through adaptive token-level multi-modal <a class="link-external link-http" href="http://fusion.The" rel="external noopener nofollow">this http URL</a> designed benchmark can foster future research on developing new methods for multi-modal learning on remote sensing data. Extensive analyses of those methods are conducted and valuable insights are provided through the experimental results. Code for the benchmark and baselines can be accessed at \url{<a class="link-external link-https" href="https://github.com/EarthNets/RSI-MMSegmentation" rel="external noopener nofollow">this https URL</a>}.

A Mamba-Diffusion Framework for Multimodal Remote Sensing Image Semantic Segmentation

P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery

Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion

Multi-source collaborative enhanced for remote sensing images semantic segmentation

Robust 3D Semantic Segmentation Method Based on Multi-Modal Collaborative Learning

RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation

FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model

Link Aggregation for Skip Connection–Mamba: Remote Sensing Image Segmentation Network Based on Link Aggregation Mamba

RS3Mamba: Visual State Space Model for Remote Sensing Images Semantic Segmentation

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

UNetMamba: An Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images

Semantic Segmentation of Remote-Sensing Images Based on Multiscale Feature Fusion and Attention Refinement

A ViT-Based Multiscale Feature Fusion Approach for Remote Sensing Image Segmentation

MSFMamba: Multi-Scale Feature Fusion State Space Model for Multi-Source Remote Sensing Image Classification

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

A Multisensor Data Fusion Model for Semantic Segmentation in Aerial Images

PPMamba: A Pyramid Pooling Local Auxiliary SSM-Based Model for Remote Sensing Image Semantic Segmentation

GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for Remote Sensing Data

Elevation Information-Guided Multimodal Fusion Robust Framework for Remote Sensing Image Segmentation