Abstract:Determining the exact latitude and longitude that a photo was taken is a useful and widely applicable task, yet it remains exceptionally difficult despite the accelerated progress of other computer vision tasks. Most previous approaches have opted to learn a single representation of query images, which are then classified at different levels of geographic granularity. These approaches fail to exploit the different visual cues that give context to different hierarchies, such as the country, state, and city level. To this end, we introduce an end-to-end transformer-based architecture that exploits the relationship between different geographic levels (which we refer to as hierarchies) and the corresponding visual scene information in an image through hierarchical cross-attention. We achieve this by learning a query for each geographic hierarchy and scene type. Furthermore, we learn a separate representation for different environmental scenes, as different scenes in the same location are often defined by completely different visual features. We achieve state of the art street level accuracy on 4 standard geo-localization datasets : Im2GPS, Im2GPS3k, YFCC4k, and YFCC26k, as well as qualitatively demonstrate how our method learns different representations for different visual hierarchies and scenes, which has not been demonstrated in the previous methods. These previous testing datasets mostly consist of iconic landmarks or images taken from social media, which makes them either a memorization task, or biased towards certain places. To address this issue we introduce a much harder testing dataset, Google-World-Streets-15k, comprised of images taken from Google Streetview covering the whole planet and present state of the art results. Our code will be made available in the camera-ready version.

Multi-modal, multi-resource methods for placing Flickr videos on the map

Advanced Techniques for Geospatial Referencing in Online Media Repositories

Embedding Geographic Locations for Modelling the Natural Environment using Flickr Tags and Structured Data

Maps from Motion (MfM): Generating 2D Semantic Maps from Sparse Multi-view Images

A Unified Geolocation Framework for Web Videos

Learning Neighborhood Representation from Multi-Modal Multi-Graph: Image, Text, Mobility Graph and Beyond

An Efficient Approach for Geo-Multimedia Cross-Modal Retrieval

Fusion of Multimodal Embeddings for Ad-Hoc Video Search

Web Video Geolocation by Geotagged Social Resources

Statewide Visual Geolocalization in the Wild

City-Identification of Flickr Videos Using Semantic Acoustic Features

Geographic Mapping with Unsupervised Multi-Modal Representation Learning from VHR Images and POIs

Visualizing and Analyzing Video Content with Interactive Scalable Maps

When Location Meets Social Multimedia: A Survey on Vision-Based Recognition and Mining for Geo-Social Multimedia Analytics

Localizing Web Videos from Heterogeneous Images.

Multimodal Information Joint Learning for Geotagged Image Search.

Where We Are and What We're Looking At: Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes

Localizing Events in Videos with Multimodal Queries

Enhancing Micro-Video Venue Recognition via Multi-Modal and Multi-Granularity Object Relations

Annotating and navigating tourist videos.