Com-STAL: Compositional Spatio-Temporal Action Localization

Shaomeng Wang,Rui Yan,Peng Huang,Guangzhao Dai,Yan Song,Xiangbo Shu
DOI: https://doi.org/10.1109/tcsvt.2023.3276979
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Spatio-temporal action localization aims to locate the spatial and temporal positions of actors and classify their actions. However, prior research overlooks the fact that human actions often interact with novel objects in real-world scenarios, which neglects the various combinations of action-object, and considerably limits the generalization of the developed models. In this paper, we study the action-object combinations by researching multi-modal vision information of them. To this end, we propose a novel compositional spatio-temporal action localization (Com-STAL) task, which features non-overlapping action-object combinations in their training and test sets. Based on this, we construct a compositional action localization dataset (Com-AD). Beyond that, we propose a simple yet effective framework, Instance-Centric Interaction Network (ICIN), to reduce invalid induction biases within the visual modality and alleviate the combined distribution bias issue by leveraging additional modal information. The extensive experiment results on Com-AD demonstrate superior action localization performance of ICIN.
What problem does this paper attempt to address?