Abstract:Cross-view image matching for geo-localisation is a challenging problem due to the significant visual difference between aerial and ground-level viewpoints. The method provides localisation capabilities from geo-referenced images, eliminating the need for external devices or costly equipment. This enhances the capacity of agents to autonomously determine their position, navigate, and operate effectively in GNSS-denied environments. Current research employs a variety of techniques to reduce the domain gap such as applying polar transforms to aerial images or synthesising between perspectives. However, these approaches generally rely on having a 360° field of view, limiting real-world feasibility. We propose BEV-CV, an approach introducing two key novelties with a focus on improving the real-world viability of cross-view geo-localisation. Firstly bringing ground-level images into a semantic Birds-Eye-View before matching embeddings, allowing for direct comparison with aerial image representations. Secondly, we adapt datasets into application realistic format - limited Field-of-View images aligned to vehicle direction. BEV-CV achieves state-of-the-art recall accuracies, improving Top-1 rates of 70° crops of CVUSA and CVACT by 23% and 24% respectively. Also decreasing computational requirements by reducing floating point operations to below previous works, and decreasing embedding dimensionality by 33% - together allowing for faster localisation capabilities.

What problem does this paper attempt to address?

This paper attempts to address the problem of matching difficulties caused by significant visual differences between ground-view images and aerial images in Cross-View Geo-Localisation (CVGL). Specifically, existing methods often rely on 360° panoramic images or expensive equipment to reduce the domain gap between views, which poses limitations in practical applications. This paper proposes a new method—BEV-CV (Birds-Eye-View Cross-View), aiming to improve the performance and practical feasibility of CVGL by converting ground-view images into semantic Birds-Eye-View (BEV) maps and matching them under a limited Field-of-View (FOV). ### Main Issues 1. **Viewpoint Differences**: There are significant visual differences between ground-view images and aerial images, leading to matching difficulties. 2. **Limitations of Existing Methods**: Existing methods usually require 360° panoramic images or expensive equipment, limiting their practical application. 3. **Computational Efficiency**: Existing methods demand high computational resources, which is unfavorable for practical applications such as mobile robots. ### Solutions 1. **BEV Conversion**: Convert ground-view images into semantic Birds-Eye-View maps to directly compare with aerial images. 2. **Limited Field-of-View Images**: Use limited Field-of-View images for matching, which is more in line with practical application scenarios. 3. **Multi-Branch Architecture**: Design a multi-branch architecture to extract features from both views and project them into a shared representation space. 4. **Computational Optimization**: Improve computational efficiency by reducing floating-point operations and lowering the embedding dimension. ### Experimental Results - **Recall Rate Improvement**: On the CVUSA and CV ACT datasets, BEV-CV improved the Top-1 recall rate by 23% and 24%, respectively. - **Computational Efficiency**: Reduced the number of floating-point operations and lowered the embedding dimension by 33%, thereby reducing query time and memory requirements. ### Conclusion BEV-CV significantly improves the performance and practical feasibility of cross-view geo-localisation by introducing semantic Birds-Eye-View conversion and limited Field-of-View image matching, while also being more efficient in terms of computational resources.

BEV-CV: Birds-Eye-View Transform for Cross-View Geo-Localisation

Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

BEVLoc: Cross-View Localization and Matching via Birds-Eye-View Synthesis

C-BEV: Contrastive Bird's Eye View Training for Cross-View Image Retrieval and 3-DoF Pose Estimation

Ground–Satellite Coupling for Cross-View Geolocation Combined With Multiscale Fusion of Spatial Features

Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization

Where am I looking at? Joint Location and Orientation Estimation by Cross-View Matching

SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation

Evaluation of Cross-View Matching to Improve Ground Vehicle Localization with Aerial Perception

Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization

Focus on BEV: Self-calibrated Cycle View Transformation for Monocular Birds-Eye-View Segmentation

U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization

BEV-Seg: Bird's Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud

UAV-Satellite View Synthesis for Cross-view Geo-Localization

BEV-Locator: An End-to-end Visual Semantic Localization Network Using Multi-View Images

CVLNet: Cross-View Semantic Correspondence Learning for Video-based Camera Localization

Learning Cross-View Visual Geo-Localization Without Ground Truth

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Cross-View Meets Diffusion: Aerial Image Synthesis with Geometry and Text Guidance

Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials

Image-Based Geo-Localization Using Satellite Imagery