Abstract:In this paper, we investigate the problem of abductive visual reasoning (AVR), which requires vision systems to infer the most plausible explanation for visual observations. Unlike previous work which performs visual reasoning on static images or synthesized scenes, we exploit long-term reasoning from instructional videos that contain a wealth of detailed information about the physical world. We conceptualize two tasks for this emerging and challenging topic. The primary task is AVR, which is based on the initial configuration and desired goal from an instructional video, and the model is expected to figure out what is the most plausible sequence of steps to achieve the goal. In order to avoid trivial solutions based on appearance information rather than reasoning, the second task called AVR++ is constructed, which requires the model to answer why the unselected options are less plausible. We introduce a new dataset called VideoABC, which consists of 46,354 unique steps derived from 11,827 instructional videos, formulated as 13,526 abductive reasoning questions with an average reasoning duration of 51 seconds. Through an adversarial hard hypothesis mining algorithm, non-trivial and high-quality problems are generated efficiently and effectively. To achieve human-level reasoning, we propose a Hierarchical Dual Reasoning Network (HDRNet) to capture the long-term dependencies among steps and observations. We establish a benchmark for abductive visual reasoning, and our method set state-of-the-arts on AVR ( similar to 74 %) and AVR++ ( similar to 45 %), and humans can easily achieve over 90% accuracy on these two tasks. The large performance gap reveals the limitation of current video understanding models on temporal reasoning and leaves substantial room for future research on this challenging problem. Our dataset and code are available at https://github.com/wl-zhao/VideoABC .

From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering

Visual Causal Scene Refinement for Video Question Answering

Explore Multi-Step Reasoning in Video Question Answering

Causal Understanding For Video Question Answering

Joint Answering and Explanation for Visual Commonsense Reasoning

Instance-sequence reasoning for video question answering

Video Question Answering with Semantic Disentanglement and Reasoning

TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering Via Bi-directional Reasoning

Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering

BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind

Video Question Answering: Datasets, Algorithms and Challenges

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

IntentQA: Context-aware Video Intent Reasoning

VideoABC: A Real-World Video Dataset for Abductive Visual Reasoning

iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability

Equivariant and Invariant Grounding for Video Question Answering

Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

From Recognition to Cognition: Visual Commonsense Reasoning

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning