Spatial-Temporal Pyramid Graph Reasoning for Action Recognition

Tiantian Geng,Feng Zheng,Xiaorong Hou,Ke Lu,Guo-Jun Qi,Ling Shao
DOI: https://doi.org/10.1109/tip.2022.3196175
IF: 10.6
2022-01-01
IEEE Transactions on Image Processing
Abstract:Spatial-temporal relation reasoning is a significant yet challenging problem for video action recognition. Previous works typically apply local operations like 2D or 3D CNNs to conduct space-time interactions in video sequences, or simply capture space-time long-range relations of a single fixed scale. However, this is inadequate for obtaining a comprehensive action representation. Besides, most models treat all input frames equally for the final classification, without selecting key frames and motion-sensitive regions. This introduces irrelevant video content and hurts the performance of models. In this paper, we propose a generic Spatial-Temporal Pyramid Graph Network (STPG-Net) to adaptively capture long-range spatial-temporal relations in video sequences at multiple scales. Specifically, we design a temporal attention (TA) module and a spatial-temporal attention (STA) module to learn the contribution of each frame and each space-time region to an action at a feature level, respectively. We then apply the selected key information to build spatial-temporal pyramid graphs for long-range relation reasoning and more comprehensive action representation learning. STPG-Net can be flexibly integrated into 2D and 3D backbone networks in a plug-and-play manner. Extensive experiments show that it brings consistent improvements over many challenging baselines on several standard action recognition benchmarks ( i.e. , Something-Something V1 & V2, and FineGym), demonstrating the effectiveness of our approach.
What problem does this paper attempt to address?