STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Hu Cui,Tessai Hayama
DOI: https://doi.org/10.1007/s00530-023-01251-2
IF: 3.9
2024-01-27
Multimedia Systems
Abstract:Skeleton-based human action recognition has attracted widespread interest, as skeleton data are extremely robust to changes in lighting, camera views, and complex backgrounds. In recent studies, transformer-based methods are proposed for the encoding of the latent information underlying the 3D skeleton. These methods focus on modeling the relationships of joints in skeleton sequences without any predefined graphical information by self-attention mechanism and have been proven to be effective. But there are two challenging issues ignored in these methods: the utilization of human body-related and dynamic semantic information. In this work, we propose a novel spatial–temporal semantic decomposition transformer network (STSD-TR) that models dependencies between joints with body parts semantics and sub-action semantics. In our STSD-TR, a body parts semantic decomposition module (BPSD) is used to extract body parts semantic information from 3D coordinates of joints, and then a temporal-local spatial–temporal attention module (TL-STA) is used to capture the relationships of joints in several consecutive frames which can be understood as local sub-action semantic information. Finally, a global spatial–temporal module (GST) is used to aggregate the temporal-local features and generate a global spatial–temporal representation. Moreover, we design a BodyParts-Mix strategy which mixes body parts from two people in a unique manner and further boosts the performance. Compared with the state-of-the-art methods, our method achieves competitive performance on two large-scale datasets.
computer science, information systems, theory & methods
What problem does this paper attempt to address?