GeoViewMatch: A Multi-Scale Feature-Matching Network for Cross-View Geo-Localization Using Swin-Transformer and Contrastive Learning

Wenhui Zhang,Zhinong Zhong,Hao Chen,Ning Jing
DOI: https://doi.org/10.3390/rs16040678
IF: 5
2024-02-15
Remote Sensing
Abstract:Cross-view geo-localization aims to locate street-view images by matching them with a collection of GPS-tagged remote sensing (RS) images. Due to the significant viewpoint and appearance differences between street-view images and RS images, this task is highly challenging. While deep learning-based methods have shown their dominance in the cross-view geo-localization task, existing models have difficulties in extracting comprehensive meaningful features from both domains of images. This limitation results in not establishing accurate and robust dependencies between street-view images and the corresponding RS images. To address the aforementioned issues, this paper proposes a novel and lightweight neural network for cross-view geo-localization. Firstly, in order to capture more diverse information, we propose a module for extracting multi-scale features from images. Secondly, we introduce contrastive learning and design a contrastive loss to further enhance the robustness in extracting and aligning meaningful multi-scale features. Finally, we conduct comprehensive experiments on two open benchmarks. The experimental results have demonstrated the superiority of the proposed method over the state-of-the-art methods.
environmental sciences,imaging science & photographic technology,remote sensing,geosciences, multidisciplinary
What problem does this paper attempt to address?
The paper aims to address the key challenge in cross-view geo-localization, which is how to determine the location of a street view image by matching it with a collection of remote sensing (RS) images tagged with global positioning system (GPS) coordinates. This task is highly challenging due to the significant viewpoint and appearance differences between street view images and remote sensing images. The paper proposes a novel lightweight neural network model named GeoViewMatch to tackle the aforementioned problem. Specifically, the main contributions of this method are as follows: 1. **Multi-scale Feature Extraction**: To capture more diverse information, the paper proposes a module to extract multi-scale features from images. This approach helps to establish more accurate and robust dependencies between street view images and their corresponding remote sensing images. 2. **Application of Contrastive Learning**: Contrastive learning is introduced, and a contrastive loss function is designed to further enhance the robustness of extracting and aligning meaningful multi-scale features from different types of images. 3. **Swin-Transformer-based Model**: By leveraging the powerful global modeling capability and self-attention mechanism of Swin-Transformer, the model can effectively handle the viewpoint differences between street view images and remote sensing images and extract multi-scale features from them. 4. **Experimental Validation**: Extensive experiments were conducted on two public benchmark datasets, and the results show that the proposed GeoViewMatch method outperforms existing state-of-the-art methods in terms of accuracy and efficiency. In summary, this study proposes an effective solution to improve feature representation capability in cross-view geo-localization tasks by combining Swin-Transformer and contrastive learning techniques, thereby achieving more accurate localization.