Abstract:This paper provides the first broad overview of the relation between different interpretation methods and human eye-movement behaviour across different tasks and architectures. The interpretation methods of neural networks provide the information the machine considers important, while the human eye-gaze has been believed to be a proxy of the human cognitive process. Thus, comparing them explains machine behaviour in terms of human behaviour, leading to improvement in machine performance through minimising their difference. We consider three types of natural language processing (NLP) tasks: sentiment analysis, relation classification and question answering, and four interpretation methods based on: simple gradient, integrated gradient, input-perturbation and attention, and three architectures: LSTM, CNN and Transformer. We leverage two corpora annotated with eye-gaze information: the Zuco dataset and the MQA-RC dataset. This research sets up two research questions. First, we investigate whether the saliency (importance) of input-words conform with those from human eye-gaze features. To this end, we compute a saliency distance (SD) between input words (by an interpretation method) and an eye-gaze feature. SD is defined as the KL-divergence between the saliency distribution over input words and an eye-gaze feature. We found that the SD scores vary depending on the combinations of tasks, interpretation methods and architectures. Second, we investigate whether the models with good saliency conformity to human eye-gaze behaviour have better prediction performances. To this end, we propose a novel evaluation device called "SD-performance curve" (SDPC) which represents the cumulative model performance against the SD scores. SDPC enables us to analyse the underlying phenomena that were overlooked using only the macroscopic metrics, such as average SD scores and rank correlations, that are typically used in the past studies. We observe that the impact of good saliency conformity between humans and machines on task performance varies among the combinations of tasks, interpretation methods and architectures. Our findings should be considered when introducing eye-gaze information for model training to improve the model performance.

Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task

Eye Movements Reveal Spatiotemporal Dynamics of Visually-Informed Planning in Navigation

Understanding Humans' Strategies in Maze Solving

Human Scanpath Estimation Based on Semantic Segmentation Guided by Common Eye Fixation Behaviors

Gaze-based Human Intention Prediction in the Hybrid Foraging Search Task

Energy-Efficient Visual Search by Eye Movement and Low-Latency Spiking Neural Network

Neuro-Eye: Decoding of High-Temporal Resolution Eye Movements via Functional Magnetic Resonance Imaging

Improving cognitive-state analysis from eye gaze with synthetic eye-movement data

Modelling Human Visual Motion Processing with Trainable Motion Energy Sensing and a Self-attention Network

MRGazer: Decoding Eye Gaze Points from Functional Magnetic Resonance Imaging in Individual Space

A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos

Context-Aware Head-and-Eye Motion Generation with Diffusion Model

A GPU-accelerated cortical neural network model for visually guided robot navigation

MIDAS: Deep learning human action intention prediction from natural eye movement patterns

Realistic 3D human saccades generated by a 6-DOF biomimetic robotic eye under optimal control

Gaze Movement Control Neural Network Based on Multidimensional Topographic Class Grouping.

Scanpaths Generation for Target Search Based on Deep Learning

EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning

Simulating human saccadic scanpaths on natural images

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

Looking deep in the eyes: Investigating interpretation methods for neural models on reading tasks using human eye-movement behaviour