Abstract:Reasoning is one of the central topics in artificial intelligence. As an important reasoning paradigm, entailment recognition has attracted much research interest, which judges if a hypothesis can be inferred from given premises. However, existing research mainly focuses on text-based analysis, that is, recognizing textual entailment (RTE), which limits its depth and width. Actually, the knowledge and inference process of human are across different sensory organs like language and vision, with unique perspectives to represent complementary reasoning cues. It is significant to extend existing entailment recognition research to cross-media scenarios, that is, recognizing cross-media entailment (RCE). Therefore, this article focuses on one representative RCE task: visual-textual reasoning, and proposes the visual-textual hybrid sequence matching (VHSM) approach. VHSM can reason from image-text premises to text hypotheses, whose contributions are: 1) visual-textual hybrid multicontext inference is proposed to address RCE via matching with hybrid context embeddings, along with adaptive gated aggregation to obtain the final prediction results. It can fully exploit complementary visual-textual cue interaction during joint reasoning; 2) memory attention-based context embedding is proposed to sequentially encode hybrid context embeddings, with the memory attention networks to compare neighboring time-steps. This can capture the important memory dimensions by coefficient assignment, which fully exploits the visual-textual context correlation; and 3) cross-task and visual-textual transfer strategy is further proposed to enrich correlation training information for boosting reasoning accuracy, which transfers knowledge not only from cross-media retrieval task to RCE but also between corresponding text and image premises. The experimental results of recognizing visual-textual entailment task on the SNLI dataset verify the effectiveness of VHSM.

Cross-media web video topic detection based on heterogeneous interactive tensor learning

A Knowledge-Based Semisupervised Hierarchical Online Topic Detection Framework.

Tensor-based transductive learning for multimodality video semantic concept detection

Fusing Cross-Media for Topic Detection by Dense Keyword Groups

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

RCE-HIL

Active post-refined multimodality video semantic concept detection with tensor representation.

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

A multi-modal fusion approach for measuring web video relatedness

Multimodal Topic Learning for Video Recommendation

Topic Mining on Web-Shared Videos

Web video topic discovery and tracking via bipartite graph reinforcement model.

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval.

Exploring Inter-Frame Correlation Analysis and Wavelet-Domain Modeling for Real-Time Caption Detection in Streaming Video

Joint Image-Text News Topic Detection and Tracking by Multimodal Topic And-Or Graph

Two Kinds of Timing Cues and Their Usage in Concept Detection in News Video.

Video Captioning with Guidance of Multimodal Latent Topics

Visual–Textual Hybrid Sequence Matching for Joint Reasoning

Topic Detection in News Video and Audio

Transductive Multi-Modality Video Semantic Concept Detection with Tensor Representation