Spatial Temporal Block Transformer Network for Skeleton-Based Action Recognition

Fan Yang,Dewei Li,Gang Wang
DOI: https://doi.org/10.1109/cac57257.2022.10055641
2022-01-01
Abstract:Recently, skeleton-based action recognition has be-come very popular in areas such as autonomous driving and environmental awareness due to its small data size and high ac-curacy. Graph convolutional networks (GCN) can model skeleton data well by adapting to the graph structure of the skeleton. However, GCN lacks the ability to extract global spatial and temporal dependencies. Moreover, an action is not only formed by one single skeleton point or frame but several skeleton points and frames. In this paper, a novel spatial-temporal block transformer network based on the self-attention mechanism is proposed, which efficiently models global spatial-temporal dependencies. The model uses spatial-temporal blocks as input, and spatial-temporal feature aggregation modules are applied to enhance the information exchange between blocks. Extensive experiments on the two skeleton-based action recognition datasets show the remarkable performance of our model, which outperforms most GCN-based methods and all transformer-based methods.
What problem does this paper attempt to address?