Abstract:For a given video-based Human-Object Interaction scene, modeling the spatio-temporal relationship between humans and objects are the important cue to understand the contextual information presented in the video. With the effective spatio-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also to directly capture inter-time dependencies. It is more critical to capture the position changes of human and objects over the spatio-temporal dimension when their appearance features may not show up significant changes over time. The full use of appearance features, the spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance. In this paper, Spatio-Temporal Interaction Graph Parsing Networks (STIGPN) are constructed, which encode the videos with a graph composed of human and object nodes. These nodes are connected by two types of relations: (i) spatial relations modeling the interactions between human and the interacted objects within each frame. (ii) inter-time relations capturing the long range dependencies between human and the interacted objects across frame. With the graph, STIGPN learn spatio-temporal features directly from the whole video-based Human-Object Interaction scenes. Multi-modal features and a multi-stream fusion strategy are used to enhance the reasoning capability of STIGPN. Two Human-Object Interaction video datasets, including CAD-120 and Something-Else, are used to evaluate the proposed architectures, and the state-of-the-art performance demonstrates the superiority of STIGPN.

Modeling 4d Human-Object Interactions for Event and Object Recognition

Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization.

Hi4D: 4D Instance Segmentation of Close Human Interaction

Explicit modeling of human-object interactions in realistic videos

4D Association Graph for Realtime Multi-person Motion Capture Using Multiple Video Cameras

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction

Kinematics-based 3D Human-Object Interaction Reconstruction from Single View

Human-Object Interaction Recognition by Modeling Context

Online Robust Action Recognition Based on a Hierarchical Model

Dynamic Graph Modules for Modeling Object-Object Interactions in Activity Recognition

InterTrack: Tracking Human Object Interaction without Object Templates

Recognising human interaction from videos by a discriminative model

Human Interaction Representation and Recognition Through Motion Decomposition.

Detecting and Recognizing Human-Object Interactions

Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition

Video-object segmentation and 3D-trajectory estimation for monocular video sequences

Human-object Interaction Detection with Depth-Augmented Clues

BEHAVE: Dataset and Method for Tracking Human Object Interactions

Recognizing Conversational Interaction Based On 3d Human Pose

U4D: Unsupervised 4D Dynamic Scene Understanding