Abstract:Predicting the unknown from the first-person perspective is expected as a necessary step toward machine intelligence, which is essential for practical applications including autonomous driving and robotics. As a human-level task, egocentric action anticipation aims at predicting an unknown action seconds before it is performed from the first-person viewpoint. Egocentric actions are usually provided as verb-noun pairs; however, predicting the unknown action may be trapped in insufficient training data for all possible combinations. Therefore, it is crucial for intelligent systems to use limited known verb-noun pairs to predict new combinations of actions that have never appeared, which is known as compositional generalization. In this article, we are the first to explore the egocentric compositional action anticipation problem, which is more in line with real-world settings but neglected by existing studies. Whereas prediction results are prone to suffer from semantic bias considering the distinct difference between training and test distributions, we further introduce a general and flexible adaptive semantic debiasing framework that is compatible with different deep neural networks. To capture and mitigate semantic bias, we can imagine one counterfactual situation where no visual representations have been observed and only semantic patterns of observation are used to predict the next action. Instead of the traditional counterfactual analysis scheme that reduces semantic bias in a mindless way, we devise a novel counterfactual analysis scheme to adaptively amplify or penalize the effect of semantic experience by considering the discrepancy both among categories and among examples. We also demonstrate that the traditional counterfactual analysis scheme is a special case of the devised adaptive counterfactual analysis scheme. We conduct experiments on three large-scale egocentric video datasets. Experimental results verify the superiority and effectiveness of our proposed solution.

An Egocentric Action Anticipation Framework via Fusing Intuition and Analysis

Learning to Anticipate Egocentric Actions by Imagination

VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

Intention Action Anticipation Model with Guide-Feedback Loop Mechanism

Intention-Conditioned Long-Term Human Egocentric Action Forecasting

Toward Egocentric Compositional Action Anticipation with Adaptive Semantic Debiasing

What if We Could Not See? Counterfactual Analysis for Egocentric Action Anticipation

From Recognition to Prediction: Leveraging Sequence Reasoning for Action Anticipation

Untrimmed Action Anticipation

Anticipation and next action forecasting in video: an end-to-end model with memory

Streaming egocentric action anticipation: An evaluation scheme and approach

ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

Object-centric Video Representation for Long-term Action Anticipation

Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention

Interaction Region Visual Transformer for Egocentric Action Anticipation

Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Enhancing Next Active Object-based Egocentric Action Anticipation with Guided Attention

Anticipating Next Active Objects for Egocentric Videos

Delving into 3D Action Anticipation from Streaming Videos