Abstract:Profiting from the advance of deep convolutional networks, current state-of-the-art video action recognition models have achieved remarkable progress. Nevertheless, most of existing models suffer from low interpretability of the predicted actions. Inspired by the observation that temporally-configured human-object interactions often serve as a key indicator of many actions, this work crafts an action reasoning framework that performs Markov Logic Network (MLN) based probabilistic logical inference. Crucially, we propose to encode an action by first-order logical rules that correspond to the temporal changes of visual relationships in videos. The main contributions of this work are two-fold: 1) Different from existing black-box models, the proposed model simultaneously implements the localization of temporal boundaries and the recognition of action categories by grounding the logical rules of MLN in videos. The weight associated with each such rule further provides an estimate of confidence. These collectively make our model more explainable and robust. 2) Instead of using hand-crafted logical rules in conventional MLN, we develop a data-driven instantiation of the MLN. In specific, a hybrid learning scheme is proposed. It combines MLN's weight learning and reinforcement learning, using the former's results as a self-critic for guiding the latter's training. Additionally, by treating actions as logical predicates, the proposed framework can also be integrated with deep models for further performance boost. Comprehensive experiments on two complex video action datasets (Charades & CAD-120) clearly demonstrate the effectiveness and explainability of our proposed method.

Open Set Action Recognition via Multi-Label Evidential Learning

Evidential Deep Learning for Open Set Action Recognition

Uncertainty-Aware Dual-Evidential Learning for Weakly-Supervised Temporal Action Localization

Representing Videos As Discriminative Sub-graphs for Action Recognition*

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Multi-Modal Multi-Action Video Recognition.

Actor-agnostic Multi-label Action Recognition with Multi-modal Query

Action Recognition with Actons

Bidirectional Uncertainty-Based Active Learning for Open Set Annotation

Towards Open Set Video Anomaly Detection

Enlarging Instance-specific and Class-specific Information for Open-set Action Recognition

Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

A Dual-threshold Based Evidential Openmax Approach for Open Set Recognition

Semi-supervised Learning for Multi-label Video Action Detection

Complex Video Action Reasoning Via Learnable Markov Logic Network

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Localized Multiple Kernel Learning for Realistic Human Action Recognition in Videos

Multi-Task Learning of Generalizable Representations for Video Action Recognition

Action Recognition by Exploring Data Distribution and Feature Correlation

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Action Selection Learning for Multi-label Multi-view Action Recognition