Improving satellite imagery segmentation using multiple Sentinel-2 revisits

Kartik Jindgar,Grace W. Lindsay
2024-10-01
Abstract:In recent years, analysis of remote sensing data has benefited immensely from borrowing techniques from the broader field of computer vision, such as the use of shared models pre-trained on large and diverse datasets. However, satellite imagery has unique features that are not accounted for in traditional computer vision, such as the existence of multiple revisits of the same location. Here, we explore the best way to use revisits in the framework of fine-tuning pre-trained remote sensing models. We focus on an applied research question of relevance to climate change mitigation -- power substation segmentation -- that is representative of applied uses of pre-trained models more generally. Through extensive tests of different multi-temporal input schemes across diverse model architectures, we find that fusing representations from multiple revisits in the model latent space is superior to other methods of using revisits, including as a form of data augmentation. We also find that a SWIN Transformer-based architecture performs better than U-nets and ViT-based models. We verify the generality of our results on a separate building density estimation task.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to most effectively utilize the multiple revisit data of satellite images in the pre - trained remote sensing model framework to improve the performance of segmentation tasks. Specifically, the author focuses on applied research problems related to climate change mitigation, such as the segmentation of power substations, which represents a broader application scenario of pre - trained models. ### Main problems and objectives of the paper 1. **Utilizing multi - temporal data**: A unique feature of satellite images is that they can be taken multiple times (i.e., multiple revisits) of the same location at different time points. Traditional computer vision methods do not take this feature into account. Therefore, the paper aims to explore how to most effectively utilize these multiple revisit data during the process of fine - tuning the pre - trained remote sensing model. 2. **Improving segmentation performance**: By experimentally comparing different multi - temporal input strategies, the paper hopes to find methods that can significantly improve segmentation performance. In particular, the author focuses on how to fuse information from multiple revisits in the latent space of the model. 3. **Verifying the universality of the method**: To ensure that the proposed method is not only applicable to the power substation segmentation task but also universal, the author verifies it on an independent building density estimation task. ### Main contributions - **Optimal strategy**: The paper finds that fusing the representations of multiple revisits in the latent space of the model (latent space fusion) is the most effective method of using revisit data, superior to other methods such as data augmentation. - **Model architecture selection**: The SWIN Transformer architecture outperforms the U - Net and ViT - based models, especially when dealing with multi - temporal inputs. - **Universality verification**: Through experiments on another building density estimation task, the effectiveness and universality of the proposed method are verified. ### Experimental setup - **Datasets**: - **Power substation dataset**: Collected by TransitionZero, it contains Sentinel - 2 images of more than 27,000 locations, with 4 - 5 revisit images for each location. - **PhilEO building density estimation dataset**: Contains global Sentinel - 2 images, with at least 3 revisit images for each location. - **Model architectures**: - U - Net (ResNet50 backbone) - SWIN Transformer - ViT - based model - **Multi - temporal input strategies**: - Single - image input - Data - augmented single - image input - Average single - image input - Latent - space - fused multi - image input - Output - fused multi - image input ### Conclusions The paper shows through extensive experiments that fusing the representations of multiple revisits in the model's latent space can significantly improve the performance of segmentation tasks. In addition, the SWIN Transformer architecture performs best in such tasks. These findings provide valuable insights for researchers in the field of remote sensing, especially in terms of how to effectively utilize multiple revisit data. ### Formula presentation The formulas involved in the paper include standardization and normalization in data preprocessing: - **Standardization**: \[ z - score=\frac{input - mean}{std} \] - **Normalization**: \[ normalized\ value=\frac{input - min}{max - min} \] - **Scaling by a constant**: \[ new\ value = clip\left(\frac{input}{constant}, 0, 1\right) \] These formulas are used to adjust the pixel values of input images to meet the requirements of different models.