Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data

Ivica Dimitrovski,Vlatko Spasev,Ivan Kitanovski
2024-10-01
Abstract:Accurate semantic segmentation of remote sensing imagery is critical for various Earth observation applications, such as land cover mapping, urban planning, and environmental monitoring. However, individual data sources often present limitations for this task. Very High Resolution (VHR) aerial imagery provides rich spatial details but cannot capture temporal information about land cover changes. Conversely, Satellite Image Time Series (SITS) capture temporal dynamics, such as seasonal variations in vegetation, but with limited spatial resolution, making it difficult to distinguish fine-scale objects. This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation that leverages the complementary strengths of both VHR aerial imagery and SITS. The proposed model consists of two independent deep learning branches. One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (MaxViT) backbone. The other branch captures complex spatio-temporal dynamics from the Sentinel-2 satellite image time series using a U-Net with Temporal Attention Encoder (U-TAE). This approach leads to state-of-the-art results on the FLAIR dataset, a large-scale benchmark for land cover segmentation using multi-source optical imagery. The findings highlight the importance of multi-modality fusion in improving the accuracy and robustness of semantic segmentation in remote sensing applications.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of achieving accurate semantic segmentation in the field of remote sensing. Specifically, the paper proposes a Late Fusion Deep Learning Model (LF-DLM) aimed at combining the advantages of Very High Resolution (VHR) aerial imagery and Satellite Image Time Series (SITS) to improve the accuracy of semantic segmentation in remote sensing images. The main objectives include: 1. **Overcoming the limitations of a single data source**: VHR aerial imagery provides rich spatial details but lacks temporal information; whereas SITS can capture temporal dynamics (such as seasonal vegetation changes), but its spatial resolution is low, making it difficult to distinguish small objects on the ground. 2. **Improving segmentation accuracy and robustness**: By fusing information from both data sources, LF-DLM can achieve higher segmentation accuracy across different land cover types while maintaining efficient inference speed. 3. **Achieving state-of-the-art performance on multi-source optical imagery**: LF-DLM has achieved significant results on the FLAIR dataset, surpassing previous benchmark methods and setting a new standard. The main contributions of the paper include: - Proposing a Late Fusion Deep Learning Model that leverages the complementary advantages of VHR aerial imagery and SITS data to enhance semantic segmentation of remote sensing images. - Experimental results show that LF-DLM effectively combines spatial and temporal information, excelling in the segmentation of various land cover types while maintaining efficient inference time. - Achieving state-of-the-art performance on the FLAIR dataset, surpassing previous methods.