Learning Human-Object Interactions by Graph Parsing Neural Networks

Siyuan Qi,Wenguan Wang,Baoxiong Jia,Jianbing Shen,Song-Chun Zhu
DOI: https://doi.org/10.1007/978-3-030-01240-3_25
2018-01-01
Abstract:This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images and videos. We introduce the Graph Parsing Neural Network (GPNN), a framework that incorporates structural knowledge while being differentiable end-to-end. For a given scene, GPNN infers a parse graph that includes (i) the HOI graph structure represented by an adjacency matrix, and (ii) the node labels. Within a message passing inference framework, GPNN iteratively computes the adjacency matrices and node labels. We extensively evaluate our model on three HOI detection benchmarks on images and videos: HICO-DET, V-COCO, and CAD-120 datasets. Our approach significantly outperforms state-of-art methods, verifying that GPNN is scalable to large datasets and applies to spatial-temporal settings.
What problem does this paper attempt to address?