A Spatiotemporal Fusion Network for Skeleton-Based Action Recognition

Wenxia Bao,Junyi Wang,Xianjun Yang,Hemu Chen
DOI: https://doi.org/10.1109/icipmc62364.2024.10586602
2024-01-01
Abstract:Based on the recent research trend, skeleton-based action recognition algorithms have shown great potential in various fields such as video surveillance and human-computer interaction. This paper addresses the limitations of handcrafted feature extraction in existing convolutional neural network (CNN) based methods and the issues of high computational complexity and structural complexity in graph convolutional neural network (GCN) models. We propose an action recognition network that incorporates both spatial and temporal features. Specifically, we utilize skeleton data as input for joint stream and compute the difference of skeleton data as input for motion stream. We construct shallow and deep feature extraction modules and enhance the network’s long-term modeling capability by stacking Transformer-based feature extraction sub-modules on top of convolutional layers. The extracted features are concatenated and passed through fully connected layers to obtain the final prediction. The proposed network achieves classification accuracies of 91.06% and 83.16% on the NTU RGB-D dataset under the CV and CS evaluation protocols, respectively. With a Flops of 1.3G, the network achieves a good balance between computational efficiency and classification accuracy.
What problem does this paper attempt to address?