Abstract:Geometric information in the normalized digital surface models (nDSM) is highly correlated with the semantic class of the land cover. Exploiting two modalities (RGB and nDSM (height)) jointly has great potential to improve the segmentation performance. However, it is still an under-explored field in remote sensing due to the following challenges. First, the scales of existing datasets are relatively small and the diversity of existing datasets is limited, which restricts the ability of validation. Second, there is a lack of unified benchmarks for performance assessment, which leads to difficulties in comparing the effectiveness of different models. Last, sophisticated multi-modal semantic segmentation methods have not been deeply explored for remote sensing data. To cope with these challenges, in this paper, we introduce a new remote-sensing benchmark dataset for multi-modal semantic segmentation based on RGB-Height (RGB-H) data. Towards a fair and comprehensive analysis of existing methods, the proposed benchmark consists of 1) a large-scale dataset including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a comprehensive evaluation and analysis of existing multi-modal fusion strategies for both convolutional and Transformer-based networks on remote sensing data. Furthermore, we propose a novel and effective Transformer-based intermediary multi-modal fusion (TIMF) module to improve the semantic segmentation performance through adaptive token-level multi-modal <a class="link-external link-http" href="http://fusion.The" rel="external noopener nofollow">this http URL</a> designed benchmark can foster future research on developing new methods for multi-modal learning on remote sensing data. Extensive analyses of those methods are conducted and valuable insights are provided through the experimental results. Code for the benchmark and baselines can be accessed at \url{<a class="link-external link-https" href="https://github.com/EarthNets/RSI-MMSegmentation" rel="external noopener nofollow">this https URL</a>}.

SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

Satellite Video Multi-Label Scene Classification With Spatial and Temporal Feature Cooperative Encoding: A Benchmark Dataset and Method

RSGPT: A Remote Sensing Vision Language Model and Benchmark

OmniSat: Self-Supervised Modality Fusion for Earth Observation

BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping

SAT-MTB-SOS: A Benchmark Dataset for Satellite Video Single Object Segmentation

A Multitask Benchmark Dataset for Satellite Video: Object Detection, Tracking, and Segmentation.

Csrs-Siat: A Benchmark Remote Sensing Dataset to Semantic-Enabled and Cross-Scales Scene Recognition

SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

SAT: Spatial Aptitude Training for Multimodal Language Models

Deep Learning for Understanding Satellite Imagery: An Experimental Survey

BENCHMARKING DEEP LEARNING FRAMEWORKS FOR THE CLASSIFICATION OF VERY HIGH RESOLUTION SATELLITE MULTISPECTRAL DATA

RSI-CB: A Large Scale Remote Sensing Image Classification Benchmark via Crowdsource Data

GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for Remote Sensing Data

Few-shot satellite image classification for bringing deep learning on board OPS-SAT

Revisiting pre-trained remote sensing model benchmarks: resizing and normalization matters

SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing