Abstract:Reasoning is one of the central topics in artificial intelligence. As an important reasoning paradigm, entailment recognition has attracted much research interest, which judges if a hypothesis can be inferred from given premises. However, existing research mainly focuses on text-based analysis, that is, recognizing textual entailment (RTE), which limits its depth and width. Actually, the knowledge and inference process of human are across different sensory organs like language and vision, with unique perspectives to represent complementary reasoning cues. It is significant to extend existing entailment recognition research to cross-media scenarios, that is, recognizing cross-media entailment (RCE). Therefore, this article focuses on one representative RCE task: visual-textual reasoning, and proposes the visual-textual hybrid sequence matching (VHSM) approach. VHSM can reason from image-text premises to text hypotheses, whose contributions are: 1) visual-textual hybrid multicontext inference is proposed to address RCE via matching with hybrid context embeddings, along with adaptive gated aggregation to obtain the final prediction results. It can fully exploit complementary visual-textual cue interaction during joint reasoning; 2) memory attention-based context embedding is proposed to sequentially encode hybrid context embeddings, with the memory attention networks to compare neighboring time-steps. This can capture the important memory dimensions by coefficient assignment, which fully exploits the visual-textual context correlation; and 3) cross-task and visual-textual transfer strategy is further proposed to enrich correlation training information for boosting reasoning accuracy, which transfers knowledge not only from cross-media retrieval task to RCE but also between corresponding text and image premises. The experimental results of recognizing visual-textual entailment task on the SNLI dataset verify the effectiveness of VHSM.

Visual Contextual Semantic Reasoning for Cross-Modal Drone Image-Text Retrieval

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Context‐aware relation enhancement and similarity reasoning for image‐text retrieval

A Multimodal Approach for Cross-Domain Image Retrieval

Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrieval

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Visual context learning based on textual knowledge for image-text retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models

A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

SIRS: Multitask Joint Learning for Remote Sensing Foreground-Entity Image–Text Retrieval

Cross-modal alignment with graph reasoning for image-text retrieval

A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text

Region-based Cross-modal Retrieval

Geometry Sensitive Cross-Modal Reasoning for Composed Query Based Image Retrieval

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Visual–Textual Hybrid Sequence Matching for Joint Reasoning

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval