Abstract:Reasoning is one of the central topics in artificial intelligence. As an important reasoning paradigm, entailment recognition has attracted much research interest, which judges if a hypothesis can be inferred from given premises. However, existing research mainly focuses on text-based analysis, that is, recognizing textual entailment (RTE), which limits its depth and width. Actually, the knowledge and inference process of human are across different sensory organs like language and vision, with unique perspectives to represent complementary reasoning cues. It is significant to extend existing entailment recognition research to cross-media scenarios, that is, recognizing cross-media entailment (RCE). Therefore, this article focuses on one representative RCE task: visual-textual reasoning, and proposes the visual-textual hybrid sequence matching (VHSM) approach. VHSM can reason from image-text premises to text hypotheses, whose contributions are: 1) visual-textual hybrid multicontext inference is proposed to address RCE via matching with hybrid context embeddings, along with adaptive gated aggregation to obtain the final prediction results. It can fully exploit complementary visual-textual cue interaction during joint reasoning; 2) memory attention-based context embedding is proposed to sequentially encode hybrid context embeddings, with the memory attention networks to compare neighboring time-steps. This can capture the important memory dimensions by coefficient assignment, which fully exploits the visual-textual context correlation; and 3) cross-task and visual-textual transfer strategy is further proposed to enrich correlation training information for boosting reasoning accuracy, which transfers knowledge not only from cross-media retrieval task to RCE but also between corresponding text and image premises. The experimental results of recognizing visual-textual entailment task on the SNLI dataset verify the effectiveness of VHSM.

Visual–Textual Hybrid Sequence Matching for Joint Reasoning

RCE-HIL

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

Visual Semantic Reasoning for Image-Text Matching

KM 4 : Visual reasoning via Knowledge Embedding Memory Model with Mutual Modulation

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

Joint Answering and Explanation for Visual Commonsense Reasoning

Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval

Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

Multi-view and region reasoning semantic enhancement for image-text retrieval

Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

Visual Contextual Semantic Reasoning for Cross-Modal Drone Image-Text Retrieval

From Recognition to Cognition: Visual Commonsense Reasoning

Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning