Abstract:Profiting from the advance of deep convolutional networks, current state-of-the-art video action recognition models have achieved remarkable progress. Nevertheless, most of existing models suffer from low interpretability of the predicted actions. Inspired by the observation that temporally-configured human-object interactions often serve as a key indicator of many actions, this work crafts an action reasoning framework that performs Markov Logic Network (MLN) based probabilistic logical inference. Crucially, we propose to encode an action by first-order logical rules that correspond to the temporal changes of visual relationships in videos. The main contributions of this work are two-fold: 1) Different from existing black-box models, the proposed model simultaneously implements the localization of temporal boundaries and the recognition of action categories by grounding the logical rules of MLN in videos. The weight associated with each such rule further provides an estimate of confidence. These collectively make our model more explainable and robust. 2) Instead of using hand-crafted logical rules in conventional MLN, we develop a data-driven instantiation of the MLN. In specific, a hybrid learning scheme is proposed. It combines MLN's weight learning and reinforcement learning, using the former's results as a self-critic for guiding the latter's training. Additionally, by treating actions as logical predicates, the proposed framework can also be integrated with deep models for further performance boost. Comprehensive experiments on two complex video action datasets (Charades & CAD-120) clearly demonstrate the effectiveness and explainability of our proposed method.

Learning Grammar of Complex Activities via Deep Neural Networks

A Hybrid Graph Network for Complex Activity Detection in Video

Learning a Grammar Inducer from Massive Uncurated Instructional Videos

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Learning Social Affordance Grammar from Videos: Transferring Human Interactions to Human-Robot Interactions

A General Framework of Learning Multi-Vehicle Interaction Patterns from Videos

A General Framework of Learning Multi-Vehicle Interaction Patterns from Video

Unsupervised Learning of Event AND-OR Grammar and Semantics from Video

Learning Human Activities and Object Affordances from RGB-D Videos

Unsupervised Learning and Segmentation of Complex Activities from Video

Activity Grammars for Temporal Action Segmentation

Long Activity Video Understanding Using Functional Object-Oriented Network

Video as the New Language for Real-World Decision Making

Pattern Theory-Based Interpretation of Activities

A Grammatical Compositional Model for Video Action Detection

An Extended Grammar System for Learning and Recognizing Complex Visual Events

Compositional Learning of Human Activities With a Self-Organizing Neural Architecture

A Deep Understanding Video Q&A System for Film Education in Acting Department

Grammarization-Based Grasping with Deep Multi-Autoencoder Latent Space Exploration by Reinforcement Learning Agent

Complex Video Action Reasoning Via Learnable Markov Logic Network

Video Action Understanding