Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth

Zimin Xia,Yujiao Shi,Hongdong Li,Julian F. P. Kooij
2024-06-01
Abstract:Given a ground-level query image and a geo-referenced aerial image that covers the query's local surroundings, fine-grained cross-view localization aims to estimate the location of the ground camera inside the aerial image. Recent works have focused on developing advanced networks trained with accurate ground truth (GT) locations of ground images. However, the trained models always suffer a performance drop when applied to images in a new target area that differs from training. In most deployment scenarios, acquiring fine GT, i.e. accurate GT locations, for target-area images to re-train the network can be expensive and sometimes infeasible. In contrast, collecting images with noisy GT with errors of tens of meters is often easy. Motivated by this, our paper focuses on improving the performance of a trained model in a new target area by leveraging only the target-area images without fine GT. We propose a weakly supervised learning approach based on knowledge self-distillation. This approach uses predictions from a pre-trained model as pseudo GT to supervise a copy of itself. Our approach includes a mode-based pseudo GT generation for reducing uncertainty in pseudo GT and an outlier filtering method to remove unreliable pseudo GT. Our approach is validated using two recent state-of-the-art models on two benchmarks. The results demonstrate that it consistently and considerably boosts the localization accuracy in the target area.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the performance degradation issue of Fine-Grained Cross-View Localization in new target areas. Specifically, when applying a trained model to a new target area, the model's localization performance significantly drops due to the lack of precise Fine Ground Truth. Obtaining precise ground truth in new target areas is usually expensive and impractical, but collecting images with Noisy Ground Truth is relatively easy. To solve this problem, the paper proposes a weakly supervised learning method based on knowledge self-distillation, which uses ground-to-aerial image pairs in the target area (without precise ground truth) to improve the localization performance of the pre-trained model. This method supervises the training of the student model by generating Pseudo Ground Truth and introduces techniques such as mode generation and outlier filtering to reduce uncertainty and noise in the pseudo ground truth. ### Main Contributions 1. **Knowledge Self-Distillation Weakly Supervised Learning Method**: This method significantly improves the model's localization performance in new areas by using only ground-to-aerial image pairs in the target area, without relying on precise ground truth. 2. **Reducing Uncertainty**: For methods with coarse-to-fine outputs, the paper explores how to reduce uncertainty and noise in the teacher model's predictions, finding that using unimodal pseudo ground truth is more effective than multimodal heatmaps. 3. **Outlier Filtering**: A simple yet effective method is designed to filter outliers in the pseudo ground truth, further improving the localization accuracy of the student model. ### Method Overview 1. **Task Definition**: Given a ground-level image and a geo-referenced aerial image covering its surrounding environment, the goal is to determine the location coordinates of the ground camera in the aerial image. 2. **UDA Method**: Considering the high uncertainty of cross-region samples, the paper adopts a knowledge self-distillation method, using the teacher model trained in the source area to generate pseudo ground truth to supervise the training of the student model in the target area. 3. **Pseudo Ground Truth Generation**: Generate "clean" pseudo ground truth that only represents its mode, reducing uncertainty. 4. **Outlier Filtering**: Identify and filter out unreliable samples by comparing the predictions of the teacher model and an auxiliary student model. ### Experimental Results The paper conducts experiments on the VIGOR and KITTI datasets, showing that the proposed method significantly improves the localization performance of the student model in new target areas, surpassing the baseline model of direct generalization. Even compared to the Oracle model supervised with precise ground truth for fine-tuning, the proposed method achieves better performance improvement at a lower cost. ### Conclusion The paper successfully addresses the performance degradation issue of Fine-Grained Cross-View Localization in new target areas. The proposed method has high practical value in real-world applications, especially in scenarios where obtaining precise ground truth is difficult.