Heterogeneous Graph Network for Action Detection

Yisheng Zhao,Huaiyu Zhu,Ruohong Huan,Yaoqi Bao,Yun Pan
DOI: https://doi.org/10.1109/tcsvt.2024.3383477
2024-01-01
Abstract:Spatio-temporal action detection is a fundamental task that detects persons and recognizes their actions from videos. It requires reasoning about the spatial-temporal interactions between persons and their surroundings. Recently, more modalities have been found by researchers, which puts higher demands on the reasoning capability of the method, yet a method capable of holistic reasoning is still lacking. To this end, we propose a heterogeneous graph network, which aims to reason the spatial-temporal interactions among different types of nodes (video entities) and edges (inter-entity relations). Concretely, it includes spatial and temporal graphs, which are alternately updated. The spatial graph contains nodes of person appearance, person pose, object appearance, and hand interaction, and the temporal graph has person nodes at different moments. For information aggregation, we propose a person-centric heterogeneous graph reasoning algorithm, which introduces heterogeneity into the graphs through node-type-specific projections and modulated edge-type-specific representations. We find that the introduction of heterogeneity enriches the model’s ability to understand multi-modality, which facilitates better parsing of complex semantic relations in videos and potentially leads to further mining of spatial-temporal interactions between entities in the future. Experimental results on four public datasets demonstrate the superiority of our method. Code is available at https://github.com/actiondetection.
What problem does this paper attempt to address?