Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data

Ivica Dimitrovski,Vlatko Spasev,Ivan Kitanovski

2024-10-01

Abstract:Accurate semantic segmentation of remote sensing imagery is critical for various Earth observation applications, such as land cover mapping, urban planning, and environmental monitoring. However, individual data sources often present limitations for this task. Very High Resolution (VHR) aerial imagery provides rich spatial details but cannot capture temporal information about land cover changes. Conversely, Satellite Image Time Series (SITS) capture temporal dynamics, such as seasonal variations in vegetation, but with limited spatial resolution, making it difficult to distinguish fine-scale objects. This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation that leverages the complementary strengths of both VHR aerial imagery and SITS. The proposed model consists of two independent deep learning branches. One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (MaxViT) backbone. The other branch captures complex spatio-temporal dynamics from the Sentinel-2 satellite image time series using a U-Net with Temporal Attention Encoder (U-TAE). This approach leads to state-of-the-art results on the FLAIR dataset, a large-scale benchmark for land cover segmentation using multi-source optical imagery. The findings highlight the importance of multi-modality fusion in improving the accuracy and robustness of semantic segmentation in remote sensing applications.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the problem of achieving accurate semantic segmentation in the field of remote sensing. Specifically, the paper proposes a Late Fusion Deep Learning Model (LF-DLM) aimed at combining the advantages of Very High Resolution (VHR) aerial imagery and Satellite Image Time Series (SITS) to improve the accuracy of semantic segmentation in remote sensing images. The main objectives include: 1. **Overcoming the limitations of a single data source**: VHR aerial imagery provides rich spatial details but lacks temporal information; whereas SITS can capture temporal dynamics (such as seasonal vegetation changes), but its spatial resolution is low, making it difficult to distinguish small objects on the ground. 2. **Improving segmentation accuracy and robustness**: By fusing information from both data sources, LF-DLM can achieve higher segmentation accuracy across different land cover types while maintaining efficient inference speed. 3. **Achieving state-of-the-art performance on multi-source optical imagery**: LF-DLM has achieved significant results on the FLAIR dataset, surpassing previous benchmark methods and setting a new standard. The main contributions of the paper include: - Proposing a Late Fusion Deep Learning Model that leverages the complementary advantages of VHR aerial imagery and SITS data to enhance semantic segmentation of remote sensing images. - Experimental results show that LF-DLM effectively combines spatial and temporal information, excelling in the segmentation of various land cover types while maintaining efficient inference time. - Achieving state-of-the-art performance on the FLAIR dataset, surpassing previous methods.

Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data

Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

A Dual-Branch Deep Learning Architecture for Multisensor and Multitemporal Remote Sensing Semantic Segmentation

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

A Multisensor Data Fusion Model for Semantic Segmentation in Aerial Images

U-Net Ensemble for Enhanced Semantic Segmentation in Remote Sensing Imagery

An Attention-Fused Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

MFVNet: a deep adaptive fusion network with multiple field-of-views for remote sensing image semantic segmentation

Robust Semantic Segmentation By Dense Fusion Network On Blurred VHR Remote Sensing Images

A Multi-Step Fusion Network for Semantic Segmentation of High-Resolution Aerial Images

General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation

Deep learning for semantic segmentation of remote sensing images with rich spectral content

Elevation Information-Guided Multimodal Fusion Robust Framework for Remote Sensing Image Segmentation

Deep Feature Selection-And-Fusion for RGB-D Semantic Segmentation

Semantic Segmentation of Very-High-Resolution Remote Sensing Images via Deep Multi-Feature Learning

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

Efficient Deep Semantic Segmentation for Land Cover Classification Using Sentinel Imagery

Efficient Depth Fusion Transformer for Aerial Image Semantic Segmentation

Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks

Enhanced semantic-positional feature fusion network via diverse pre-trained encoders for remote sensing image water-body segmentation