Abstract:Using Causal Induction in Humans to Learn and Infer Causality from Video Amy Fire (amy.fire@ucla.edu) Song-Chun Zhu (sczhu@stat.ucla.edu) Center for Vision, Cognition, Learning, and Art University of California, Los Angeles Los Angeles, CA 90095 USA Abstract For both human and machine learners, it is a challenge to make high-level sense of observations by identifying causes, effects, and their connections. Once these connections are learned, the knowledge can be used to infer causes and effects where visual data might be partially hidden or ambiguous. In this paper, we present a Bayesian grammar model for human-perceived causal relationships that is learnable from video. Two exper- iments investigate high-level causal induction from low-level visual cues. In the first experiment, we show that a computer can apply known heuristics used for causal induction by hu- mans to learn perceptual causal relationships. In the second experiment, we show that our learned model can represent hu- mans’ performance in reasoning about hidden effects in video, even when the computer initially misdetects those effects. Keywords: Perceptual causality; causal induction; statistical models. Introduction A man approaches a closed door. He reaches out to grasp the handle and then stands there. Is it locked? Does he not have the key? He knocks and waits, but the door remains closed. Is there no one on the other side to open it? Watching these events unfold, humans can readily answer these questions based on their causal knowledge. One way humans can learn causal relationships is through daily ob- servation by internally measuring co-occurrence of events (Griffiths & Tenenbaum, 2005). Research suggests that humans use a few heuristics to determine whether a co- occurrence is causal, including: • whether the temporal lag between cause and effect is short, and the cause precedes the effect (Carey, 2009) and • whether agent actions are responsible for causes (Saxe, Tenenbaum, & Carey, 2005). However, learning from daily observation is limited: many actions and effects are hidden. Our prior knowledge about causal relationships between actions and effects allows us to fill in information about the events in the scene. Some current models represent knowledge with Bayesian networks, e.g., (Griffiths & Tenenbaum, 2005). These mod- els, however, are disjoint from the low-level visual data that people observe. Instead, models are built using high-level an- notations. In reality, agents build knowledge by observing low-level visual data, and models need to be able to deal with uncertainty in observation. Although Bayesian networks are commonly used to repre- sent causality (Pearl, 2009), grammar models have the ex- pressive power to represent a greater breadth of possibili- ties than a single instance of a Bayesian network (Griffiths & Tenenbaum, 2007). Grammar models allow for multiple configurations and high-level structures, making them more suitable for applications grounded on visual cues; Bayesian networks lack the representative power needed for this. Grammar models are represented graphically in the And- Or Graph (AOG). In the AOG, Or-nodes represent the mul- tiple alternatives, and And-nodes represent hierarchical de- compositions. The AOG naturally lends itself to represent causation where multiple alternative causes can lead to an ef- fect, and each cause is composed of conditions necessary for the effect. In this paper, we introduce a grammar model for repre- senting causal relationships between actions and object-status changes, the Causal And-Or Graph (C-AOG). We describe methods for learning the model by using co-occurrence to identify potential causal relationships between events and ap- plying the heuristics listed above to those potential relation- ships. In two experiments, we investigate how the model matches human perceptions of causality. Experiment 1 uses input typical of computer vision detection systems to investi- gate learning the C-AOG and human perceptions of causality. Experiment 2 demonstrates that the C-AOG models human judgments on imputing hidden variables from video. A Grammar Model for Causality In this section, we introduce the Causal And-Or Graph for causal reasoning, which ties agent actions to fluents. Fluents and Actions Specifically defining those object statuses that vary over time, the term fluents comes from the commonsense-reasoning lit- erature (Mueller, 2006). Relevant here are two kinds of flu- ents that intentional agents can change: object fluents (e.g., a light can be on or off) and fluents of the mind (e.g., an agent can be thirsty or not thirsty). The values of these fluents change as a result of agent ac- tions and also trigger rational agents to take action. A lack of change-inducing action (also known as the inertial action) causes the fluent to maintain its value; for example, a door that is closed will remain closed until some action changes it. In this work, fluents are modeled discriminatively. Actions (A i ) are modeled using the Temporal And-Or Graph (T-AOG), a grammar model for actions (Pei, Jia, & Zhu, 2011). In the T-AOG, And-nodes group the necessary ways for an action to be performed that allow detection of the action (e.g., object/agent spatial relations, agent poses, scene contexts, and temporal relationships), and Or-nodes provide

Learning and inferring causality from video

Learning Perceptual Causality from Video

Inferring Hidden Statuses and Actions in Video by Causal Reasoning

Using Causal Induction in Humans to Learn and Infer Causality from Video.

Causal Induction from Visual Observations for Goal Directed Tasks

Inference of Fine-Grained Event Causality from Blogs and Films

A Causal Inference Look at Unsupervised Video Anomaly Detection

Mixed Graphical Models for Causal Analysis of Multi-modal Variables

Explaining the Behavior of Black-Box Prediction Algorithms with Causal Learning

Reinterpreting causal discovery as the task of predicting unobserved joint statistics

Learning Causality-inspired Representation Consistency for Video Anomaly Detection

iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability

Local Causal Discovery with Background Knowledge

Learning Causal State Representations of Partially Observable Environments

Causal Question Answering with Reinforcement Learning

Causal disentanglement of multimodal data

Causal Reasoning Meets Visual Representation Learning: A Prospective Study

Towards Causal Representation Learning and Deconfounding from Indefinite Data

A Causal And-Or Graph Model for Visibility Fluent Reasoning in Tracking Interacting Objects

A Local Method for Identifying Causal Relations under Markov Equivalence