Abstract:Early Action Prediction (EAP) in videos aims at forecasting the action labels from partially observed videos. It is crucial in various applications, including video surveillance, driverless cars, human-robot interaction, and patient activity monitoring. EAP becomes challenging when visual similarity exists between two actions or when one action appears as a subpart of another, leading to interrelated actions. To address this, we propose a novel approach for the early prediction of visually similar and interrelated actions. Our method involves representing each high-level action as temporal compositions of sub-actions, breaking down complex actions into sequences of smaller, more basic, and distinct "local actions." Furthermore, we construct a dictionary where each original action class serves as a key, with corresponding values representing sequences of possible constituent local actions. The proposed method comprises of two-level classifier namely base classifier and sequence classifier. The base classifier is trained on segmented local action classes using a 3DCNN-based architecture. In a partially observed video, segments are classified using the base classifier to obtain local action labels. The sequence of observed action labels is then input into the sequence classifier, predicting the high-level action class label through TF-IDF-based cosine similarity between the observed sequence and the dictionary classes. We evaluated the effectiveness of our approach using two publicly available datasets, SYSU 3D HOI and MSR Daily Activity. Our method achieved notable accuracy, reaching 82.5% on SYSU 3D HOI and 90% on MSR Daily Activity Dataset, by observing just the first 40 percent of frames.

Learning to Visually Connect Actions and their Effects

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

On the Efficacy of Text-Based Input Modalities for Action Anticipation

Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task

ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

Harnessing Temporal Causality for Advanced Temporal Action Detection

Alignment-guided Temporal Attention for Video Action Recognition

Learning Perceptual Causality from Video

Enhancing early action prediction in videos through temporal composition of sub-actions

Video Action Understanding

CAST: Cross-Attention in Space and Time for Video Action Recognition

Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

Object-centric Video Representation for Long-term Action Anticipation

Selective Visual Representations Improve Convergence and Generalization for Embodied AI

Interaction Region Visual Transformer for Egocentric Action Anticipation

AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding

Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

Learning to Select: A Fully Attentive Approach for Novel Object Captioning

Concept Parser with Multimodal Graph Learning for Video Captioning

Exploring Explainability in Video Action Recognition