Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Ana Marasović,Chandra Bhagavatula,Jae Sung Park,Ronan Le Bras,Noah A. Smith,Yejin Choi
DOI: https://doi.org/10.48550/arXiv.2010.07526
2020-10-15
Abstract:Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to generate rationales in natural language. These rationales can provide intuitive, high - level understanding and are easy for humans to understand, especially for complex visual reasoning tasks, such as visual commonsense reasoning, visual - text entailment and visual question answering. Specifically, the paper focuses on how to generate explanations in free - text form by combining pre - trained language models with visual features such as image recognition, semantic frameworks and commonsense graphs. This involves not only the understanding of the basic content of the image (pixel - level), but also the in - depth understanding of the context content of the image (semantic - level and pragmatic - level). The paper proposes an integrated model named RATIONALEVT TRANSFORMER, aiming to improve the quality of generated explanations by combining these different levels of visual understanding.