Abstract:Conditional inference on joint textual and visual clues is a multi-modal reasoning task that textual clues provide prior permutation or external knowledge, which are complementary with visual content and pivotal to deducing the correct option. Previous methods utilizing pretrained vision-language models (VLMs) have achieved impressive performances, yet they show a lack of multimodal context reasoning capability, especially for text-modal information. To address this issue, we propose a Multi-modal Context Reasoning approach, named ModCR. Compared to VLMs performing reasoning via cross modal semantic alignment, it regards the given textual abstract semantic and objective image information as the pre-context information and embeds them into the language model to perform context reasoning. Different from recent vision-aided language models used in natural language processing, ModCR incorporates the multi-view semantic alignment information between language and vision by introducing the learnable alignment prefix between image and text in the pretrained language model. This makes the language model well-suitable for such multi-modal reasoning scenario on joint textual and visual clues. We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance (exact gain by 4.8% on PMR test set) compared to previous strong baselines. Code Link: \url{<a class="link-external link-https" href="https://github.com/YunxinLi/Multimodal-Context-Reasoning" rel="external noopener nofollow">this https URL</a>}.

Context-Aware Tree-Based Convolutional Neural Networks for Natural Language Inference.

Natural Language Inference by Tree-Based Convolution and Heuristic Matching

Natural Language Inference Using Lstm Model With Sentence Fusion

Context-Aware Dual-Attention Network for Natural Language Inference

Recognizing Entailment and Contradiction by Tree-based Convolution.

Convolutional Interaction Network for Natural Language Inference

Discriminative Neural Sentence Modeling by Tree-Based Convolution

Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference.

Tree-based convolution: A new architecture for sentence modeling

Transformer-Based Contextualized Language Models Joint with Neural Networks for Natural Language Inference in Vietnamese

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

LMCK: pre-trained language models enhanced with contextual knowledge for Vietnamese natural language inference

Enhanced Lstm For Natural Language Inference

Knowledge Adaptive Neural Network for Natural Language Inference.

Multi-turn Inference Matching Network for Natural Language Inference

Convolutional Neural Networks over Tree Structures for Programming Language Processing

Tree-based Convolution for Sentence Modeling.

Asynchronous Deep Interaction Network for Natural Language Inference.

Explaining Text Matching on Neural Natural Language Inference

Natural Language Inference in Context -- Investigating Contextual Reasoning over Long Texts

ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs.