Abstract:Cross-view geo-localization (CVGL) involves determining the geographical location of a query image by matching it with a corresponding GPS-tagged reference image. Current state-of-the-art methods predominantly rely on training models with labeled paired images, incurring substantial annotation costs and training burdens. In this study, we investigate the adaptation of frozen models for CVGL without requiring ground-truth pair labels. We observe that training on unlabeled cross-view images presents significant challenges, including establishing relationships within unlabeled data and reconciling view discrepancies between uncertain queries and references. To address these challenges, we propose a self-supervised learning framework to train a learnable adapter for a frozen foundation model (FM). This adapter is designed to map feature distributions from diverse views into a uniform space using unlabeled data exclusively. To establish relationships within unlabeled data, we introduce an expectation-maximization (EM)-based pseudolabeling module, which iteratively estimates matching between cross-view features and optimizes the adapter. To maintain the robustness of the FM's representation, we incorporate an information consistency module with a reconstruction loss, ensuring that adapted features retain strong discriminative ability across views. Experimental results demonstrate that our proposed method achieves significant improvements over vanilla FMs and competitive accuracy compared to supervised methods while necessitating fewer training parameters and relying solely on unlabeled data. Evaluation of our adaptation for task-specific models further highlights its broad applicability. Particularly, on the University-1652 dataset, our method outperforms the FM baseline by a substantial margin, achieving about 39 points improvement in Recall@1 and more than 34 points increase in average precision (AP). The project is available at https://collebt.github.io/EM-CVGL.

Fusing Geometric and Scene Information for Cross-View Geo-Localization.

Multibranch Joint Representation Learning Based on Information Fusion Strategy for Cross-View Geo-Localization

Ground–Satellite Coupling for Cross-View Geolocation Combined With Multiscale Fusion of Spatial Features

Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

UAV-Satellite View Synthesis for Cross-view Geo-Localization

Direction-Guided Multiscale Feature Fusion Network for Geo-Localization

IML-Net: A Framework for Cross-View Geo-Localization with Multi-Domain Remote Sensing Data

4SCIG: A Four-branch Framework to Reduce the Interference of Sky Area in Cross-view Image Geo-localization

Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials

Crossview Mapping with Graph-based Geolocalization on City-Scale Street Maps

Geo-Localization via Ground-to-Satellite Cross-View Image Retrieval

Learning Cross-View Visual Geo-Localization Without Ground Truth

Cross-view Geo-localization via Learning Disentangled Geometric Layout Correspondence

Revisiting Street-to-Aerial View Image Geo-localization and Orientation Estimation

From Satellite to Ground: Satellite Assisted Visual Localization with Cross-view Semantic Matching

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization

TransFG:A Cross-view Geo-Localization of Satellite and UAVs Imagery Pipeline Using Transformer-Based Feature Aggregation and Gradient Guidance

A Satellite-Drone Image Cross-View Geolocalization Method Based on Multi-Scale Information and Dual-Channel Attention Mechanism

Simple, Effective and General: A New Backbone for Cross-view Image Geo-localization

TransFG: A Cross-View Geo-Localization of Satellite and UAVs Imagery Pipeline Using Transformer-Based Feature Aggregation and Gradient Guidance