Glimpse and Zoom: Spatio-Temporal Focused Dynamic Network for Skeleton-based Action Recognition

Zhifu Zhao,Ziwei Chen,Jianan Li,Xiaotian Wang,Xuemei Xie,Lei Huang,Wanxin Zhang,Guangming Shi
DOI: https://doi.org/10.1109/tcsvt.2024.3358836
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:GCN-based methods have achieved remarkable performance in skeleton-based action recognition. However, existing methods have not explicitly attempted to remove temporal and spatial redundancy that might introduce additional computational costs. Inspired by the fact that humans always tend to glimpse at overall motion and then zoom into the most important spatio-temporal regions, we propose a Spatio Temporal Focused Dynamic Network (STFD-Net) trained with reinforcement learning for skeleton-based action recognition. Specifically, we first propose a global extractor with Skeleton Pooling Module (SPM) to enable the network to focus on overall motion information with a refined skeleton structure. Then, a local extractor, containing pair-wise part partition, tubelet proposal network, and Partition-Grouped Module (PGM), is proposed to extract local motion details as a complement to the overall motion information. Finally, the dynamic classifier utilizes a recurrent neural network to dynamically terminate the process once the network is adequately confident. Extensive experiments have demonstrated that the proposed network achieves SOTA level performance with lower computational cost on the NTU 60 and NTU 120 dataset.
What problem does this paper attempt to address?