Abstract:The endeavor of spatial position reasoning effectively simulates the sensory and comprehension faculties of artificial intelligence, especially within the purview of multimodal modeling that fuses imagery with linguistic data. Recent progress in visual image–language models has marked significant advancements in multimodal reasoning tasks. Notably, contrastive learning models based on the Contrastive Language-Image pre-training (CLIP) framework have attracted substantial interest. Predominantly, current contrastive learning models focus on nominal and verbal elements within image descriptions, while spatial locatives receive comparatively less attention. However, prepositional spatial indicators are pivotal for encapsulating the critical positional data between entities within images, which is essential for the reasoning capabilities of image–language models. This paper introduces a spatial location reasoning model that is founded on spatial locative terms. The model concentrates on spatial prepositions within image descriptions, models the locational interrelations between entities in images through these prepositions, evaluates and corroborates the spatial interconnections of entities within images, and harmonizes the precision with image–textual descriptions. This model represents an enhancement of the CLIP model, delving deeply into the semantic characteristics of spatial prepositions and highlighting their directive role in visual language models. Empirical evidence suggests that the proposed model adeptly captures the correlation of spatial indicators in both image and textual representations across open datasets. The incorporation of spatial position terms into the model was observed to elevate the average predictive accuracy by approximately three percentage points.

QR-CLIP: Introducing Explicit Knowledge for Location and Time Reasoning

QR-CLIP: Introducing Explicit Open-World Knowledge for Location and Time Reasoning

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

There is a Time and Place for Reasoning Beyond the Image

Spatial Position Reasoning of Image Entities Based on Location Words

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

Explicit Knowledge-based Reasoning for Visual Question Answering

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Quadruple Mention Text-Enhanced Temporal Knowledge Graph Reasoning

Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Quartet Logic: A Four-Step Reasoning (QLFR) framework for advancing Short Text Classification

Explicit Knowledge Incorporation for Visual Reasoning

TimeR4 : Time-aware Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering

Visual Mind: Visual Question Answering (VQA) with CLIP Model

A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization

Knowledge-Embedded Mutual Guidance for Visual Reasoning

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Joint Answering and Explanation for Visual Commonsense Reasoning

Exploiting Intrinsic Multilateral Logical Rules for Weakly Supervised Natural Language Video Localization

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning