GazeHTA: End-to-end Gaze Target Detection with Head-Target Association

Zhi-Yi Lin,Jouh Yeong Chew,Jan van Gemert,Xucong Zhang
2024-04-19
Abstract:We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two main challenges in multi - target gaze detection: 1. **Detecting heads and gaze targets**: Identify relevant heads, objects or other areas in the scene image as potential gaze targets. 2. **Associating heads and gaze targets**: Predict which head is looking at which target, that is, establish the association between heads and gaze targets. ### Specific problem description #### 1. Detecting heads and gaze targets - **Limitations of existing methods**: Most existing gaze target detection methods rely on independent components, such as off - the - shelf head detectors. These methods have difficulties in establishing the association between heads and gaze targets. Moreover, many methods can only process one head at a time. When there are multiple individuals in the scene, repeated processing is required to identify the gaze targets of all people. - **Solution**: The paper proposes an end - to - end multi - target gaze detection framework GazeHTA, which can predict multiple head - target instances based on the input scene image. #### 2. Associating heads and gaze targets - **Limitations of existing methods**: Existing methods usually rely on pre - trained object detectors when associating heads and gaze targets, which limits the target categories that can be considered and may exclude diverse objects or areas in the environment. - **Solution**: GazeHTA solves these problems in the following ways: - **Utilizing pre - trained diffusion models**: Extract scene features to obtain rich semantic understanding. - **Re - injecting head features**: Enhance head priors and improve the accuracy of head understanding. - **Learning connection graphs**: Generate explicit visual association graphs to explicitly represent the connection between heads and gaze targets. ### Main contributions of the paper - **First use of pre - trained diffusion models**: Extract rich semantic features for the gaze target detection task. - **Propose head feature re - injection**: Improve head priors and improve the accuracy of head detection. - **Introduce connection graphs**: Explicitly associate heads and gaze targets and provide additional supervision information. - **Experimental verification**: Extensive experimental results show that GazeHTA outperforms the existing state - of - the - art gaze target detection methods on standard datasets. ### Experimental results - **Performance comparison**: The performance of GazeHTA on the GazeFollow and VideoAttentionTarget datasets is better than other methods, especially in the mAP metric, showing its advantages in accurately identifying individuals and their associated gaze targets. - **Performance in complex scenes**: On the VideoAttentionTarget dataset, the improvement of GazeHTA is particularly significant, especially in metrics such as AUC, distance and mAP, demonstrating its efficiency in handling multi - target complex scenes. Through these innovations and improvements, GazeHTA provides a more unified and efficient multi - target gaze detection solution.