Abstract:In recent years, analysis of remote sensing data has benefited immensely from borrowing techniques from the broader field of computer vision, such as the use of shared models pre-trained on large and diverse datasets. However, satellite imagery has unique features that are not accounted for in traditional computer vision, such as the existence of multiple revisits of the same location. Here, we explore the best way to use revisits in the framework of fine-tuning pre-trained remote sensing models. We focus on an applied research question of relevance to climate change mitigation -- power substation segmentation -- that is representative of applied uses of pre-trained models more generally. Through extensive tests of different multi-temporal input schemes across diverse model architectures, we find that fusing representations from multiple revisits in the model latent space is superior to other methods of using revisits, including as a form of data augmentation. We also find that a SWIN Transformer-based architecture performs better than U-nets and ViT-based models. We verify the generality of our results on a separate building density estimation task.

What problem does this paper attempt to address?

This paper attempts to solve the problem of how to most effectively utilize the multiple revisit data of satellite images in the pre - trained remote sensing model framework to improve the performance of segmentation tasks. Specifically, the author focuses on applied research problems related to climate change mitigation, such as the segmentation of power substations, which represents a broader application scenario of pre - trained models. ### Main problems and objectives of the paper 1. **Utilizing multi - temporal data**: A unique feature of satellite images is that they can be taken multiple times (i.e., multiple revisits) of the same location at different time points. Traditional computer vision methods do not take this feature into account. Therefore, the paper aims to explore how to most effectively utilize these multiple revisit data during the process of fine - tuning the pre - trained remote sensing model. 2. **Improving segmentation performance**: By experimentally comparing different multi - temporal input strategies, the paper hopes to find methods that can significantly improve segmentation performance. In particular, the author focuses on how to fuse information from multiple revisits in the latent space of the model. 3. **Verifying the universality of the method**: To ensure that the proposed method is not only applicable to the power substation segmentation task but also universal, the author verifies it on an independent building density estimation task. ### Main contributions - **Optimal strategy**: The paper finds that fusing the representations of multiple revisits in the latent space of the model (latent space fusion) is the most effective method of using revisit data, superior to other methods such as data augmentation. - **Model architecture selection**: The SWIN Transformer architecture outperforms the U - Net and ViT - based models, especially when dealing with multi - temporal inputs. - **Universality verification**: Through experiments on another building density estimation task, the effectiveness and universality of the proposed method are verified. ### Experimental setup - **Datasets**: - **Power substation dataset**: Collected by TransitionZero, it contains Sentinel - 2 images of more than 27,000 locations, with 4 - 5 revisit images for each location. - **PhilEO building density estimation dataset**: Contains global Sentinel - 2 images, with at least 3 revisit images for each location. - **Model architectures**: - U - Net (ResNet50 backbone) - SWIN Transformer - ViT - based model - **Multi - temporal input strategies**: - Single - image input - Data - augmented single - image input - Average single - image input - Latent - space - fused multi - image input - Output - fused multi - image input ### Conclusions The paper shows through extensive experiments that fusing the representations of multiple revisits in the model's latent space can significantly improve the performance of segmentation tasks. In addition, the SWIN Transformer architecture performs best in such tasks. These findings provide valuable insights for researchers in the field of remote sensing, especially in terms of how to effectively utilize multiple revisit data. ### Formula presentation The formulas involved in the paper include standardization and normalization in data preprocessing: - **Standardization**: \[ z - score=\frac{input - mean}{std} \] - **Normalization**: \[ normalized\ value=\frac{input - min}{max - min} \] - **Scaling by a constant**: \[ new\ value = clip\left(\frac{input}{constant}, 0, 1\right) \] These formulas are used to adjust the pixel values of input images to meet the requirements of different models.

Improving satellite imagery segmentation using multiple Sentinel-2 revisits

A Dual Network for Super-Resolution and Semantic Segmentation of Sentinel-2 Imagery

Causality-guided Step-wise Intervention and Reweighting for Remote Sensing Image Semantic Segmentation

Satellite Image Time Series Semantic Change Detection: Novel Architecture and Analysis of Domain Shift

Fusing Time-Inconsistent Sentinel-2 Images and High-Resolution Remote Sensing Images

Multi-Spectral Multi-Image Super-Resolution of Sentinel-2 with Radiometric Consistency Losses and Its Effect on Building Delineation

Efficient Deep Semantic Segmentation for Land Cover Classification Using Sentinel Imagery

Revisiting the Encoding of Satellite Image Time Series

Rethinking Scanning Strategies with Vision Mamba in Semantic Segmentation of Remote Sensing Imagery: An Experimental Study

Deep learning-based harmonization and super-resolution of Landsat-8 and Sentinel-2 images

Improving Satellite Imagery Masking using Multi-task and Transfer Learning

ACTNet: A Dual-Attention Adapter with a CNN-Transformer Network for the Semantic Segmentation of Remote Sensing Imagery

Semantic segmentation of high-resolution satellite images using deep learning

SEG-ESRGAN: A Multi-Task Network for Super-Resolution and Semantic Segmentation of Remote Sensing Images

Semantic Segmentation for Change Detection in Satellite Imaging

Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images

Semantic segmentation of deep learning remote sensing images based on band combination principle: Application in urban planning and land use

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Multitemporal and multispectral data fusion for super-resolution of Sentinel-2 images

Enhancing Crop Mapping through Automated Sample Generation Based on Segment Anything Model with Medium-Resolution Satellite Imagery

RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation