Abstract:We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are two main challenges in multi - target gaze detection: 1. **Detecting heads and gaze targets**: Identify relevant heads, objects or other areas in the scene image as potential gaze targets. 2. **Associating heads and gaze targets**: Predict which head is looking at which target, that is, establish the association between heads and gaze targets. ### Specific problem description #### 1. Detecting heads and gaze targets - **Limitations of existing methods**: Most existing gaze target detection methods rely on independent components, such as off - the - shelf head detectors. These methods have difficulties in establishing the association between heads and gaze targets. Moreover, many methods can only process one head at a time. When there are multiple individuals in the scene, repeated processing is required to identify the gaze targets of all people. - **Solution**: The paper proposes an end - to - end multi - target gaze detection framework GazeHTA, which can predict multiple head - target instances based on the input scene image. #### 2. Associating heads and gaze targets - **Limitations of existing methods**: Existing methods usually rely on pre - trained object detectors when associating heads and gaze targets, which limits the target categories that can be considered and may exclude diverse objects or areas in the environment. - **Solution**: GazeHTA solves these problems in the following ways: - **Utilizing pre - trained diffusion models**: Extract scene features to obtain rich semantic understanding. - **Re - injecting head features**: Enhance head priors and improve the accuracy of head understanding. - **Learning connection graphs**: Generate explicit visual association graphs to explicitly represent the connection between heads and gaze targets. ### Main contributions of the paper - **First use of pre - trained diffusion models**: Extract rich semantic features for the gaze target detection task. - **Propose head feature re - injection**: Improve head priors and improve the accuracy of head detection. - **Introduce connection graphs**: Explicitly associate heads and gaze targets and provide additional supervision information. - **Experimental verification**: Extensive experimental results show that GazeHTA outperforms the existing state - of - the - art gaze target detection methods on standard datasets. ### Experimental results - **Performance comparison**: The performance of GazeHTA on the GazeFollow and VideoAttentionTarget datasets is better than other methods, especially in the mAP metric, showing its advantages in accurately identifying individuals and their associated gaze targets. - **Performance in complex scenes**: On the VideoAttentionTarget dataset, the improvement of GazeHTA is particularly significant, especially in metrics such as AUC, distance and mAP, demonstrating its efficiency in handling multi - target complex scenes. Through these innovations and improvements, GazeHTA provides a more unified and efficient multi - target gaze detection solution.

GazeHTA: End-to-end Gaze Target Detection with Head-Target Association

End-to-End Human-Gaze-Target Detection with Transformers

Dual Attention Guided Gaze Target Detection in the Wild

Joint Gaze-Location and Gaze-Object Detection

Depth-aware gaze-following via auxiliary networks for robotics

Multi-Person Gaze-Following with Numerical Coordinate Regression

GaTector: A Unified Framework for Gaze Object Prediction

Un-Gaze: a Unified Transformer for Joint Gaze-Location and Gaze-Object Detection

Gaze Target Estimation inspired by Interactive Attention

Gaze Gestures and Their Applications in human-computer interaction with a head-mounted display

Deep Multitask Gaze Estimation with a Constrained Landmark-Gaze Model

Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

End-to-end Video Gaze Estimation via Capturing Head-face-eye Spatial-temporal Interaction Context

Enhanced Gaze Following via Object Detection and Human Pose Estimation

ESCNet: Gaze Target Detection with the Understanding of 3D Scenes

Multiview Multitask Gaze Estimation with Deep Convolutional Neural Networks

Believe It or Not, We Know What You Are Looking at!

AL-GTD: Deep Active Learning for Gaze Target Detection

HybridGazeNet: Geometric model guided Convolutional Neural Networks for gaze estimation

Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection