Abstract:This paper digs deeper into factors that influence egocentric gaze. Instead of training deep models for this purpose in a blind manner, we propose to inspect factors that contribute to gaze guidance during daily tasks. Bottom-up saliency and optical flow are assessed versus strong spatial prior baselines. Task-specific cues such as vanishing point, manipulation point, and hand regions are analyzed as representatives of top-down information. We also look into the contribution of these factors by investigating a simple recurrent neural model for ego-centric gaze prediction. First, deep features are extracted for all input video frames. Then, a gated recurrent unit is employed to integrate information over time and to predict the next fixation. We also propose an integrated model that combines the recurrent model with several top-down and bottom-up cues. Extensive experiments over multiple datasets reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up saliency models perform poorly in predicting gaze and underperform spatial biases, (3) deep features perform better compared to traditional features, (4) as opposed to hand regions, the manipulation point is a strong influential cue for gaze prediction, (5) combining the proposed recurrent model with bottom-up cues, vanishing points and, in particular, manipulation point results in the best gaze prediction accuracy over egocentric videos, (6) the knowledge transfer works best for cases where the tasks or sequences are similar, and (7) task and activity recognition can benefit from gaze prediction. Our findings suggest that (1) there should be more emphasis on hand-object interaction and (2) the egocentric vision community should consider larger datasets including diverse stimuli and more subjects.

Anticipating Where People will Look Using Adversarial Networks

Generative Adversarial Network for Future Hand Segmentation from Egocentric Video

Dual Motion GAN for Future-Flow Embedded Video Prediction

DGaze: CNN-Based Gaze Prediction in Dynamic Scenes.

GazeMotion: Gaze-guided Human Motion Forecasting

Streaming egocentric action anticipation: An evaluation scheme and approach

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

3D Human motion anticipation and classification

Looking-Ahead: Neural Future Video Frame Prediction

Learning to Anticipate Egocentric Actions by Imagination

Video Frame Prediction by Deep Multi-Branch Mask Network

Predicting Diverse Future Frames with Local Transformation-Guided Masking.

Digging Deeper into Egocentric Gaze Prediction

A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings

Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention

3DGazeNet: Generalizing Gaze Estimation with Weak-Supervision from Synthetic Views

FIction: 4D Future Interaction Prediction from Video

A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction

FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations

Looking Ahead: Anticipating Pedestrians Crossing with Future Frames Prediction

Enhancing Next Active Object-based Egocentric Action Anticipation with Guided Attention