Cross-Scale Spatiotemporal Refinement Learning for Skeleton-Based Action Recognition

Yu Zhang,Zhonghua Sun,Meng Dai,Jinchao Feng,Kebin Jia
DOI: https://doi.org/10.1109/lsp.2024.3356808
2024-02-02
IEEE Signal Processing Letters
Abstract:As skeleton data becomes increasingly available, Graph Convolutional Networks (GCNs) are popularly adapted to extract the spatial and temporal features for skeleton-based action recognition. However, there are still limitations to be addressed in GCN-based methods. First, the multi-level semantic features fail to be connected, making fine-grained information loss as the network deepens. Second, the cross-scale spatiotempral features fail to be simultaneously considered and refined to focus on informative areas. These limitations lead to the challenge in distinguishing the confusing actions. To address these issues, we propose a cross-scale connection (CSC) structure and a spatiotemporal refinement focus (STRF) module. The CSC aims to bridge the gap between multi-level semantic features. The STRF module refines the cross-scale spatiotemporal features to focus on informative joints in each frame. Both are embedded into the standard GCNs to form the cross-scale spatiotemporal refinement network (CSR-Net). Our proposed CSR-Net explicitly models the cross-scale spatiotemporal information among multi-level semantic representations to boost the distinguishing capability for ambiguous actions. We conduct extensive experiments to demonstrate the effectiveness of our proposed method and it outperforms state-of-the-art methods on the NTU RGB+D 60, NTU-RGB+D 120 and NW-UCLA datasets.
engineering, electrical & electronic
What problem does this paper attempt to address?