Abstract:Eye movements have long been studied as a window into the attentional mechanisms of the human brain and made accessible as novelty style human-machine interfaces. However, not everything that we gaze upon, is something we want to interact with; this is known as the Midas Touch problem for gaze interfaces. To overcome the Midas Touch problem, present interfaces tend not to rely on natural gaze cues, but rather use dwell time or gaze gestures. Here we present an entirely data-driven approach to decode human intention for object manipulation tasks based solely on natural gaze cues. We run data collection experiments where 16 participants are given manipulation and inspection tasks to be performed on various objects on a table in front of them. The subjects' eye movements are recorded using wearable eye-trackers allowing the participants to freely move their head and gaze upon the scene. We use our Semantic Fovea, a convolutional neural network model to obtain the objects in the scene and their relation to gaze traces at every frame. We then evaluate the data and examine several ways to model the classification task for intention prediction. Our evaluation shows that intention prediction is not a naive result of the data, but rather relies on non-linear temporal processing of gaze cues. We model the task as a time series classification problem and design a bidirectional Long-Short-Term-Memory (LSTM) network architecture to decode intentions. Our results show that we can decode human intention of motion purely from natural gaze cues and object relative position, with $91.9\%$ accuracy. Our work demonstrates the feasibility of natural gaze as a Zero-UI interface for human-machine interaction, i.e., users will only need to act naturally, and do not need to interact with the interface itself or deviate from their natural eye movement patterns.

A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

End-to-End Human-Gaze-Target Detection with Transformers

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Gaze Estimation using Transformer

Estimation of Gaze-Following Based on Transformer and the Guiding Offset.

Sharingan: A Transformer-based Architecture for Gaze Following

Pose2Gaze: Eye-body Coordination during Daily Activities for Gaze Prediction from Full-body Poses

Event-based Vision for Early Prediction of Manipulation Actions

Digging Deeper into Egocentric Gaze Prediction

EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning

Human-Object Interaction Prediction in Videos through Gaze Following

GIMO: Gaze-Informed Human Motion Prediction in Context.

DGaze: CNN-Based Gaze Prediction in Dynamic Scenes.

Gaze Target Estimation inspired by Interactive Attention

Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention

Cascaded Learning with Transformer for Simultaneous Eye Landmark, Eye State and Gaze Estimation

Appearance-based gaze estimation enhanced with synthetic images using deep neural networks

MIDAS: Deep learning human action intention prediction from natural eye movement patterns

GazeMotion: Gaze-guided Human Motion Forecasting

SwinGaze: Egocentric Gaze Estimation with Video Swin Transformer