Abstract:Nowadays the locations of social images play an important role in geographic knowledge discovery. However, most social images still lack the location information, driving location estimation for social images to have recently become an active research topic. With the rapid growth of social images, new challenges have been posed: 1) data quality of social images is an issue because they are often associated with noises and error-prone user-generated content, such as junk comments and misspelled words; and 2) data sparsity exists in social images despite the large volume, since most of them are unevenly distributed around the world and their contextual information is often missing or incomplete. In this paper, we propose a spatial-aware multimodal location estimation (SMLE) framework to tackle the above challenges. Specifically, a spatial-aware language model (SLM) is proposed to detect the high quality location-indicative tags from large datasets. We also design a spatial-aware topic model, namely spatial-aware regularized latent semantic indexing (SRLSI), to discover geographic topics and alleviate the data sparseness problem existing in language modeling. Taking multi-modalities of social images into consideration, we employ the learning to rank approach to fuse multiple evidences derived from textual features represented by SLM and SRLSI, and visual features represented by bag-of-visual-words (BoVW). Importantly, an ad hoc method is introduced to construct the training dataset with spatial-aware relevance labels for learning to rank training. Finally, given a query image, its location is estimated as the location of its most relevant image returned from the learning to rank model. The proposed framework is evaluated on a public benchmark provided by MediaEval 2013 Placing Task, which contains more than 8.5 million images crawled from Flickr. Extensive experiments on this dataset demonstrate the superior performance of the proposed methods over the state-of-the-art approaches.

Spatial Position Reasoning of Image Entities Based on Location Words

Understanding Spatial Relations through Multiple Modalities

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Exploring Entity-Level Spatial Relationships for Image-Text Matching

Spatial-Aware Multimodal Location Estimation For Social Images

Things not Written in Text: Exploring Spatial Commonsense from Visual Signals

Spatial Constraint for Image Location Estimation.

Evaluating the Generation of Spatial Relations in Text and Image Generative Models

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Collaborative Position Reasoning Network for Referring Image Segmentation

Acquiring Common Sense Spatial Knowledge Through Implicit Spatial Templates

Holistic Spatial Reasoning for Chinese Spatial Language Understanding

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Spatially Constrained Location Prior for Scene Parsing

An effective spatial relational reasoning networks for visual question answering

Structured Spatial Reasoning with Open Vocabulary Object Detectors

Spatial Guided Image Captioning: Guiding Attention with Object's Spatial Interaction

Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning.