Abstract:In this paper, we focus on egocentric action anticipation from videos, which enables various applications, such as helping intelligent wearable assistants understand users' needs and enhance their capabilities in the interaction process. It requires intelligent systems to observe from the perspective of the first person and predict an action before it occurs. Owing to the uncertainty of future, it is insufficient to perform action anticipation relying on visual information especially when there exists salient visual difference between past and future. In order to alleviate this problem, which we call visual gap in this paper, we propose one novel Intuition-Analysis Integrated (IAI) framework inspired by psychological research, which mainly consists of three parts: Intuition-based Prediction Network (IPN), Analysis-based Prediction Network (APN) and Adaptive Fusion Network (AFN). To imitate the implicit intuitive thinking process, we model IPN as an encoder-decoder structure and introduce one procedural instruction learning strategy implemented by textual pre-training. On the other hand, we allow APN to process information under designed rules to imitate the explicit analytical thinking, which is divided into three steps: recognition, transitions and combination. Both the procedural instruction learning strategy in IPN and the transition step of APN are crucial to improving the anticipation performance via mitigating the visual gap problem. Considering the complementarity of intuition and analysis, AFN adopts attention fusion to adaptively integrate predictions from IPN and APN to produce the final anticipation results. We conduct experiments on the largest egocentric video dataset. Qualitative and quantitative evaluation results validate the effectiveness of our IAI framework, and demonstrate the advantage of bridging visual gap by utilizing multi-modal information, including both visual features of observed segments and sequential instructions of actions.

Anticipating Next Active Objects for Egocentric Videos

Anticipating Next Active Objects for Egocentric Videos

Enhancing Next Active Object-based Egocentric Action Anticipation with Guided Attention

Object-centric Video Representation for Long-term Action Anticipation

StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation

Interaction Region Visual Transformer for Egocentric Action Anticipation

Short-term Object Interaction Anticipation with Disentangled Object Detection @ Ego4D Short Term Object Interaction Anticipation Challenge

Object Aware Egocentric Online Action Detection

Anticipating Object State Changes in Long Procedural Videos

Graphing the Future: Activity and Next Active Object Prediction using Graph-based Activity Representations

Intention-Conditioned Long-Term Human Egocentric Action Forecasting

Learning to Anticipate Egocentric Actions by Imagination

An Egocentric Action Anticipation Framework via Fusing Intuition and Analysis

Egocentric Prediction of Action Target in 3D

What Will I Do Next? The Intention from Motion Experiment

Deep Attention Network for Egocentric Action Recognition.

FIction: 4D Future Interaction Prediction from Video

Objects do not disappear: Video object detection by single-frame object location anticipation

Streaming egocentric action anticipation: An evaluation scheme and approach

Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation