Abstract:Reasoning is one of the central topics in artificial intelligence. As an important reasoning paradigm, entailment recognition has attracted much research interest, which judges if a hypothesis can be inferred from given premises. However, existing research mainly focuses on text-based analysis, that is, recognizing textual entailment (RTE), which limits its depth and width. Actually, the knowledge and inference process of human are across different sensory organs like language and vision, with unique perspectives to represent complementary reasoning cues. It is significant to extend existing entailment recognition research to cross-media scenarios, that is, recognizing cross-media entailment (RCE). Therefore, this article focuses on one representative RCE task: visual-textual reasoning, and proposes the visual-textual hybrid sequence matching (VHSM) approach. VHSM can reason from image-text premises to text hypotheses, whose contributions are: 1) visual-textual hybrid multicontext inference is proposed to address RCE via matching with hybrid context embeddings, along with adaptive gated aggregation to obtain the final prediction results. It can fully exploit complementary visual-textual cue interaction during joint reasoning; 2) memory attention-based context embedding is proposed to sequentially encode hybrid context embeddings, with the memory attention networks to compare neighboring time-steps. This can capture the important memory dimensions by coefficient assignment, which fully exploits the visual-textual context correlation; and 3) cross-task and visual-textual transfer strategy is further proposed to enrich correlation training information for boosting reasoning accuracy, which transfers knowledge not only from cross-media retrieval task to RCE but also between corresponding text and image premises. The experimental results of recognizing visual-textual entailment task on the SNLI dataset verify the effectiveness of VHSM.

Visual Semantic Reasoning for Image-Text Matching

Visual-Semantic Graph Matching for Visual Grounding

Multi-view and region reasoning semantic enhancement for image-text retrieval

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

Visual-Semantic Matching by Exploring High-Order Attention and Distraction

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Image-text Retrieval via Preserving Main Semantics of Vision

Visual–Textual Hybrid Sequence Matching for Joint Reasoning

Context‐aware relation enhancement and similarity reasoning for image‐text retrieval

Scene graph semantic inference for image and text matching

Similarity Reasoning and Filtration for Image-Text Matching

Composing Object Relations and Attributes for Image-Text Matching

Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval

Cross-modal alignment with graph reasoning for image-text retrieval

Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation

Dual Semantic Relationship Attention Network for Image-Text Matching

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Exploring Entity-Level Spatial Relationships for Image-Text Matching

Cross-Modal Image-Text Retrieval with Semantic Consistency

Text-Vision Relationship Alignment for Referring Image Segmentation