Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Ana Marasović,Chandra Bhagavatula,Jae Sung Park,Ronan Le Bras,Noah A. Smith,Yejin Choi

DOI: https://doi.org/10.48550/arXiv.2010.07526

2020-10-15

Abstract:Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.

Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate rationales in natural language. These rationales can provide intuitive, high - level understanding and are easy for humans to understand, especially for complex visual reasoning tasks, such as visual commonsense reasoning, visual - text entailment and visual question answering. Specifically, the paper focuses on how to generate explanations in free - text form by combining pre - trained language models with visual features such as image recognition, semantic frameworks and commonsense graphs. This involves not only the understanding of the basic content of the image (pixel - level), but also the in - depth understanding of the context content of the image (semantic - level and pragmatic - level). The paper proposes an integrated model named RATIONALEVT TRANSFORMER, aiming to improve the quality of generated explanations by combining these different levels of visual understanding.

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Towards Explainable Neural-Symbolic Visual Reasoning

Visual Reasoning with Natural Language

Efficient End-to-End Visual Document Understanding with Rationale Distillation

Convincing Rationales for Visual Question Answering Reasoning

ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense

From Recognition to Cognition: Visual Commonsense Reasoning

Free-text Rationale Generation under Readability Level Control

iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability

From outputs to insights: a survey of rationalization approaches for explainable text classification

Automated Rationale Generation: A Technique for Explainable AI and its Effects on Human Perceptions

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

Reasoning with Natural Language Explanations

Improve Vision Language Model Chain-of-thought Reasoning

From Heuristic to Analytic: Cognitively Motivated Strategies for Coherent Physical Commonsense Reasoning

Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

CommonsenseVIS: Visualizing and Understanding Commonsense Reasoning Capabilities of Natural Language Models

Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval

Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis