Offset Regression Enhanced Cross-View Feature Interaction for Ground-to-Aerial Geo-localization

Lei Cheng,Teng Wang,Jiawen Li,Changyin Sun
DOI: https://doi.org/10.1109/tiv.2024.3411098
IF: 8.2
2024-01-01
IEEE Transactions on Intelligent Vehicles
Abstract:Cross-view geo-localization has great potential in the field of autonomous driving for vehicles. Existing methods on ground-to-aerial geo-localization show two main limitations. First, they typically train a two-stream neural network using metric learning loss to learn feature embeddings for two views, with each branch processing one view modality independently. However, without cross-view interaction, the large domain gap makes the finding of a better feature embedding harder. Besides, it also prevents these models from benefiting from fine-grained regression loss. Second, the binary labels assigned to cross-view image pairs cannot fully characterize their relevance degrees, thus confounding the learning of these models. To this end, we propose an offset regression enhanced cross-view feature interaction model (OR-CVFI) for this task. OR-CVFI introduces a cross-view attention layer on top of Siamese-like CNN backbone to maintain interactions between cross-view features. The crossattentive features are used to measure similarity between a pair of input images while regressing their position offset. For model training, we design a soft label-guided InfoNCE loss to complement the regular InfoNCE loss. Combined with the proposed two-stage inference strategy, OR-CVFI is proved to significantly outperform state-of-the-art methods in retrieval accuracy. Remarkably, the recall rate@top-1 improves from 65.23% to 74.41% and from 33.05% to 45.64% respectively on same-area and cross-area VIGOR benchmarks. The code is available at https://doi.org/10.5281/zenodo.11368565
What problem does this paper attempt to address?