Abstract:Inferring object-wise human attention in 3D space from the third-person perspective (e.g., a camera) is crucial to many visual tasks and applications, including human-robot collaboration, unmanned vehicle driving, etc. Challenges arise from classical human attention when human eyes are not visible to cameras, gaze point is outside the field of vision, or the gazed object is occluded by others in the 3D space. In this case, blind 3D human attention inference brings a new paradigm to the community. In this paper, we address these challenges by proposing a scene-behavior associated mechanism, in which both 3D scene and temporal behavior of human are adopted to infer object-wise human attention and its transition. Specifically, point cloud is reconstructed and used for the spatial representation of 3D scene, which is beneficial to handle the blind problem from the perspective of a camera. Based on this, in order to address the blind human attention inference without eye information, we propose a Sequential Skeleton Based Attention Network (S2BAN) for behavior-based attention modeling. As is embedded in the scene-behavior associated mechanism, the proposed S2BAN is built under the temporal architecture of Long-Short-Term-Memory (LSTM). Our network employs human skeleton as behavior representation, and maps it to the attention direction frame by frame, which makes attention inference a temporal-correlated issue. With the help of S2BAN, 3D gaze spot and further the attended objects can be obtained frame by frame via intersection and segmentation on the previously reconstructed point cloud. Finally, we conduct experiments from various aspects to verify the object-wise attention localization accuracy, the angular error of attention direction calculation, as well as the subjective results. The experimental results show that the proposed outperforms other competitors.

Inferring Human Attention by Learning Latent Intentions.

Learning Stereoscopic Visual Attention Model for 3d Video

Where and Why Are They Looking? Jointly Inferring Human Attention and Intentions in Complex Tasks

I Understand You: Blind 3D Human Attention Inference From the Perspective of Third-Person

Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

Inferring Human Intent from Video by Sampling Hierarchical Plans

Inferring Shared Attention In Social Scene Videos

Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

A Vision-Based Measure of Environmental Effects on Inferring Human Intention During Human Robot Interaction

Learning and Inferring "Dark Matter" and Predicting Human Intents and Trajectories in Videos.

Spatial and Temporal Visual Attention Prediction in Videos Using Eye Movement Data

LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies

Attention-Based Variational Autoencoder Models for Human-Human Interaction Recognition via Generation

Learning Recurrent 3D Attention for Video-Based Person Re-Identification

Understanding More about Human and Machine Attention in Deep Neural Networks

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation

Unified Spatio-Temporal Attention Models for Advanced Human Action Recognition & Detection

Inferring Human Intentions from Predicted Action Probabilities

MIDAS: Deep learning human action intention prediction from natural eye movement patterns

What Is The Chance Of Happening: A New Way To Predict Where People Look

SAL3D: a model for saliency prediction in 3D meshes