Abstract:Cross-view video geo-localization (CVGL) aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Despite their promising performance, current CVGL methods face significant challenges. These methods use camera and odometry data, typically absent in real-world scenarios. They utilize multiple adjacent frames and various encoders for feature extraction, resulting in high computational costs. Moreover, these approaches independently predict each street-view frame's location, resulting in temporally inconsistent GPS trajectories. To address these challenges, in this work, we propose GAReT, a fully transformer-based method for CVGL that does not require camera and odometry data. We introduce GeoAdapter, a transformer-adapter module designed to efficiently aggregate image-level representations and adapt them for video inputs. Specifically, we train a transformer encoder on video frames and aerial images, then freeze the encoder to optimize the GeoAdapter module to obtain video-level representation. To address temporally inconsistent trajectories, we introduce TransRetriever, an encoder-decoder transformer model that predicts GPS locations of street-view frames by encoding top-k nearest neighbor predictions per frame and auto-regressively decoding the best neighbor based on the previous frame's predictions. Our method's effectiveness is validated through extensive experiments, demonstrating state-of-the-art performance on benchmark datasets. Our code is available at <a class="link-external link-https" href="https://github.com/manupillai308/GAReT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper "GAReT: Cross-View Video Geo-Localization with Adapters and Autoregressive Transformers" aims to address several key challenges in cross-view video geo-localization (CVGL). Specifically, existing methods face the following issues: 1. **Dependence on Camera and Odometry Data**: Current CVGL methods typically require camera intrinsic parameters and odometry data to model the temporal relationships in street-view videos and effectively match reference aerial images. However, in real-world scenarios, this data is often unavailable, especially for videos captured in uncontrolled environments. 2. **High Computational Cost**: To improve performance, existing methods use multiple adjacent frames and various encoders for feature extraction, leading to high computational costs. This design adds extra computational overhead for obtaining features for each frame, making it unsuitable for real-time applications. 3. **Temporally Inconsistent GPS Trajectories**: Existing methods independently predict the location of each street-view video frame, resulting in temporally inconsistent GPS trajectories. In practical applications, the prediction for each frame should follow a temporal coherence principle close to the input video. To address these issues, the authors propose GAReT, a fully transformer-based approach for cross-view video geo-localization. The main innovations of GAReT include: - **GeoAdapter**: A transformer adapter module designed to efficiently aggregate image-level representations and adapt to video inputs. - **TransRetriever**: An encoder-decoder transformer model that encodes the top k nearest neighbor predictions for each frame and autoregressively decodes the best neighbor based on the previous frame's prediction, ensuring temporally consistent GPS trajectories. Through these innovations, GAReT not only eliminates the need for camera and odometry data but also outperforms existing methods on benchmark datasets.

GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers

Geo-Localization with Transformer-Based 2D-3D Match Network

Co-Visual Pattern-Augmented Generative Transformer Learning for Automobile Geo-Localization

GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement

TransFG: A Cross-View Geo-Localization of Satellite and UAVs Imagery Pipeline Using Transformer-Based Feature Aggregation and Gradient Guidance

TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization

Learning Cross-View Visual Geo-Localization Without Ground Truth

Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization

TransFG:A Cross-view Geo-Localization of Satellite and UAVs Imagery Pipeline Using Transformer-Based Feature Aggregation and Gradient Guidance

Cross-view Geo-localization with Evolving Transformer

BEV-CV: Birds-Eye-View Transform for Cross-View Geo-Localisation

Mutual Relative Position Learning Transformer for Cross-View Geo-Localization

Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer

CVTNet: A Cross-View Transformer Network for Place Recognition Using LiDAR Data

GAMa: Cross-view Video Geo-localization

SMDT: Cross-View Geo-Localization with Image Alignment and Transformer

Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

TransVLAD: Multi-Scale Attention-Based Global Descriptors for Visual Geo-Localization.

Cross-view Geo-localization via Learning Disentangled Geometric Layout Correspondence

Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator