Decoupled Spatio-Temporal Grouping Transformer for Skeleton-Based Action Recognition

Shengkun Sun,Zihao Jia,Yisheng Zhu,Guangcan Liu,Zhengtao Yu
DOI: https://doi.org/10.1007/s00371-023-03132-1
2024-01-01
Abstract:Capturing correlations between joints is crucial in skeleton-based action recognition tasks. Transformer has demonstrated its capability in capturing such correlations. However, conventional Transformer-based approaches model the relationships between joints in a unified spatio-temporal dimension, disregarding the unique semantic information that exists in both the spatial and temporal dimensions of skeleton sequences. To address this issue, we present a novel decoupled spatio-temporal grouping Transformer (DSTGFormer) model. The skeleton sequence is split into multiple spatio-temporal groups, each containing a set of consecutive frames. The spatio-temporal positional encoding (STPE) module assigns identity information to each element in the sequence. The spatio-temporal grouping self-attention (STGA) module captures the spatial and temporal relationships between different joints within a spatio-temporal group. This decoupling of the spatial and temporal dimensions enables the extraction of semantic information with different meanings in each dimension. Additionally, we propose a within-group spatial global regularization mechanism to learn more general spatial attention maps, and an inter-group feature aggregation (IGFA) module to enhance the differentiation between similar actions. Our proposed method outperforms the state-of-the-art methods on two large-scale datasets in terms of both recognition accuracy and computational efficiency.
What problem does this paper attempt to address?