Abstract:Given a ground-level query image and a geo-referenced aerial image that covers the query's local surroundings, fine-grained cross-view localization aims to estimate the location of the ground camera inside the aerial image. Recent works have focused on developing advanced networks trained with accurate ground truth (GT) locations of ground images. However, the trained models always suffer a performance drop when applied to images in a new target area that differs from training. In most deployment scenarios, acquiring fine GT, i.e. accurate GT locations, for target-area images to re-train the network can be expensive and sometimes infeasible. In contrast, collecting images with noisy GT with errors of tens of meters is often easy. Motivated by this, our paper focuses on improving the performance of a trained model in a new target area by leveraging only the target-area images without fine GT. We propose a weakly supervised learning approach based on knowledge self-distillation. This approach uses predictions from a pre-trained model as pseudo GT to supervise a copy of itself. Our approach includes a mode-based pseudo GT generation for reducing uncertainty in pseudo GT and an outlier filtering method to remove unreliable pseudo GT. Our approach is validated using two recent state-of-the-art models on two benchmarks. The results demonstrate that it consistently and considerably boosts the localization accuracy in the target area.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the performance degradation issue of Fine-Grained Cross-View Localization in new target areas. Specifically, when applying a trained model to a new target area, the model's localization performance significantly drops due to the lack of precise Fine Ground Truth. Obtaining precise ground truth in new target areas is usually expensive and impractical, but collecting images with Noisy Ground Truth is relatively easy. To solve this problem, the paper proposes a weakly supervised learning method based on knowledge self-distillation, which uses ground-to-aerial image pairs in the target area (without precise ground truth) to improve the localization performance of the pre-trained model. This method supervises the training of the student model by generating Pseudo Ground Truth and introduces techniques such as mode generation and outlier filtering to reduce uncertainty and noise in the pseudo ground truth. ### Main Contributions 1. **Knowledge Self-Distillation Weakly Supervised Learning Method**: This method significantly improves the model's localization performance in new areas by using only ground-to-aerial image pairs in the target area, without relying on precise ground truth. 2. **Reducing Uncertainty**: For methods with coarse-to-fine outputs, the paper explores how to reduce uncertainty and noise in the teacher model's predictions, finding that using unimodal pseudo ground truth is more effective than multimodal heatmaps. 3. **Outlier Filtering**: A simple yet effective method is designed to filter outliers in the pseudo ground truth, further improving the localization accuracy of the student model. ### Method Overview 1. **Task Definition**: Given a ground-level image and a geo-referenced aerial image covering its surrounding environment, the goal is to determine the location coordinates of the ground camera in the aerial image. 2. **UDA Method**: Considering the high uncertainty of cross-region samples, the paper adopts a knowledge self-distillation method, using the teacher model trained in the source area to generate pseudo ground truth to supervise the training of the student model in the target area. 3. **Pseudo Ground Truth Generation**: Generate "clean" pseudo ground truth that only represents its mode, reducing uncertainty. 4. **Outlier Filtering**: Identify and filter out unreliable samples by comparing the predictions of the teacher model and an auxiliary student model. ### Experimental Results The paper conducts experiments on the VIGOR and KITTI datasets, showing that the proposed method significantly improves the localization performance of the student model in new target areas, surpassing the baseline model of direct generalization. Even compared to the Oracle model supervised with precise ground truth for fine-tuning, the proposed method achieves better performance improvement at a lower cost. ### Conclusion The paper successfully addresses the performance degradation issue of Fine-Grained Cross-View Localization in new target areas. The proposed method has high practical value in real-world applications, especially in scenarios where obtaining precise ground truth is difficult.

Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth

Learning Cross-view Visual Geo-localization without Ground Truth

Weak-supervised Visual Geo-localization Via Attention-based Knowledge Distillation.

Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator

Cross-view Geo-localization via Learning Disentangled Geometric Layout Correspondence

A Novel Geo-Localization Method for UAV and Satellite Images Using Cross-View Consistent Attention

From Satellite to Ground: Satellite Assisted Visual Localization with Cross-view Semantic Matching

Geo-Localization via Ground-to-Satellite Cross-View Image Retrieval

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Learning Discriminative Representations Via Variational Self-Distillation for Cross-View Geo-Localization

Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization

UAV-Satellite View Synthesis for Cross-view Geo-Localization

Visual Cross-View Metric Localization with Dense Uncertainty Estimates

CurriculumLoc: Enhancing Cross-Domain Geolocalization Through Multistage Refinement

Beyond Cross-view Image Retrieval: Highly Accurate Vehicle Localization Using Satellite Image

A Cross-View Geo-Localization Algorithm Using UAV Image and Satellite Image

Mutual Relative Position Learning Transformer for Cross-View Geo-Localization

CurriculumLoc: Enhancing Cross-Domain Geolocalization through Multi-Stage Refinement

Revisiting Street-to-Aerial View Image Geo-localization and Orientation Estimation